107

Null Subjects in Statistical Machine Translation: A Case ... · PDF fileNull Subjects in Statistical Machine Translation: ... 2.1 Pro-drop theories ... 5.4 Alignment rules

  • Upload
    buikiet

  • View
    225

  • Download
    1

Embed Size (px)

Citation preview

Universität Stuttgart

Institut für maschinelle Sprachverarbeitung

Azenbergstraÿe 12

D - 70174 Stuttgart

Diplomarbeit Nr. 77

Null Subjects in Statistical MachineTranslation:

A Case Study on AligningEnglish and Italian Verb Phrases

with Pronominal Subjects

Betreuer: Dr. Alexander FraserErstprüfer: Dr. Helmut Schmid

Zweitprüfer: apl. Prof. Dr. Ulrich Heid

Bearbeitung: Anita GojunAnmeldung: 01. Juni 2010Abgabe: 30. August 2010

Abstract

In this thesis, I present a method for aligning English and Italian parallel verb phraseswhich have pronominal subjects. The phrases contain the pronominal subject, the verbalelements of a verb phrase (VP) and the negation. I use English parse trees and partof speech tagged Italian sentences. The process of aligning parallel phrases consists ofseveral steps. An Italian sentence is searched in order to �nd all Italian VPs. In theparallel English sentence, the clauses with pronominal subjects are detected. Base wordalignment (created by GIZA++) of the elements of an English VP is used to identifythe matching Italian VP. The alignment of parallel phrases is computed by applyingalignment rules which de�ne the alignment between words with a speci�c part of speechtag.The rule-based VP alignment reaches f-score of 81% whereas f-score of the base word

alignment is 64%. The rules compute correct alignments for most parallel VPs. However,they produce erroneous alignments if false parallel phrases are identi�ed. This is thecase when the English VP is not translated, or when it corresponds to an Italian phraseof an arbitrary type (e.g. prepositional phrase). These cases are analyzed and a fewexperiments are carried out in order to solve these problems. They lead to higher recall(best recall is 84%), but lower precision.I use the rule-based word alignment to build phrase-based SMT systems with Moses

and to examine whether improved word alignment of English pronominal subjects leadsto better results when the translation of pronominal subjects between a null subjectlanguage Italian and a non-null subject language English is carried out. SMT systemsbuilt using the rule-based VP alignment receive lower BLEU scores even though thetranslations are comparable with the translations generated by SMT systems which arebuilt using the base alignment. In translation direction EN → IT, a BLEU score ofthe SMT system build using the base alignment is 19.15. The SMT system build usingthe rule-based VP alignment has a BLEU score of 18.18. In the opposite translationdirection, the SMT system build using the base alignment has a BLEU score 22.07whereas the SMT system build using the rule-based VP alignment has a BLEU score of21.81. The systems perform equally with respect to translation of pronominal subjectswhich means that the improved VP alignment does not lead to the improvement of thesubject pronoun translation between English and Italian.The analysis of translations of example sentences will show that the pronoun resolution

and syntactic analysis of both languages is necessary to ensure the correct generation ofthe corresponding subject pronoun. Furthermore, when English pronouns are translatedinto Italian, the decision must be made as to whether the Italian subject pronoun shouldbe overtly expressed.

Hiermit erkläre ich, dass ich die vorliegende Arbeit selbstständig verfasst habe und dabeikeine andere als die angegebene Literatur verwendet habe. Alle Zitate und sinngemäÿenEntlehnungen sind als solche unter genauer Angabe der Quelle gekennzeichnet.

Contents

1 Introduction 61.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Pro-drop and Null Subject Languages 92.1 Pro-drop theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Rich in�ection morphology . . . . . . . . . . . . . . . . . . . . . . 102.1.2 Zero topic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Null subjects and syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Null subjects and English syntax . . . . . . . . . . . . . . . . . . 122.2.2 Null subjects and Italian syntax . . . . . . . . . . . . . . . . . . . 13

2.3 Null subjects and pragmatics . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Statistics on null subjects in Italian . . . . . . . . . . . . . . . . . . . . . 172.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Pro-drop in machine translation 203.1 Previous work on zero pronouns in MT . . . . . . . . . . . . . . . . . . . 213.2 Translation between English and Italian . . . . . . . . . . . . . . . . . . 22

3.2.1 Italian to English . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.2 English to Italian . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Statistical machine translation 314.1 Word alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Phrase-based SMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Word alignment of English and Italian verb phrases 385.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2.1 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2.2 Italian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2.3 Data preprocessing errors . . . . . . . . . . . . . . . . . . . . . . 44

5.3 Applying alignment rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3.1 Identi�cation of Italian VPs . . . . . . . . . . . . . . . . . . . . . 475.3.2 Identi�cation of the most probable Italian VP . . . . . . . . . . . 49

5.4 Alignment rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.4.1 Syntax of the English and Italian VPs . . . . . . . . . . . . . . . 515.4.2 Subject pronouns . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.4.3 Finite verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.4.4 Participles, in�nitives and gerundives . . . . . . . . . . . . . . . . 605.4.5 Negation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4

5.4.6 In�nitival particle . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.4.7 Alignment examples . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.5.1 Precision, Recall, F-score . . . . . . . . . . . . . . . . . . . . . . . 655.5.2 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.6 System extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.6.1 Lexical search for the matching Italian VP . . . . . . . . . . . . . 785.6.2 Retaining the base alignment . . . . . . . . . . . . . . . . . . . . 80

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6 Evaluation of SMT systems 836.1 The BLEU score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.2 Evaluation of SMT systems . . . . . . . . . . . . . . . . . . . . . . . . . 846.3 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.4 Adequate training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7 Conclusion 937.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 937.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A Italian tag set 99

B English tag set (Penn Treebank Tagset) 101

C English subject pronoun occurrences 102

List of Tables 103

List of Figures 104

References 105

1 Introduction

In my diploma thesis, I addressed the problem of pro-drop in statistical machine trans-lation using the language pair English - Italian. I carried out linguistic analysis of thephenomenon with respect to machine translation, and I developed rules based on partof speech tags which de�ne the word alignment of the English subject pronoun and itsverb phrase with elements of the corresponding Italian verb phrase. I examined thegenerated translations as well as the translation parameters to �nd an explanation forde�cient translation of pronominal subjects between English and Italian.

1.1 Motivation

English is a language in which the subject position must always be occupied. In Italian,this is not the case. When the Italian subject is expressed by a pronoun, it can bedropped. This means that the English pronominal subject does not necessarily have apronominal counterpart in Italian. In the context of (statistical) machine translation,this leads to problems in both translation directions, as well as within the automaticword alignment task. The questions concerning the pronominal subjects that rise, are:

(Q1) When word alignment of parallel sentences is carried out, with which Italian wordshould an English subject pronoun be aligned when a subject pronoun in Italianis omitted?

(Q2) How can we make sure that the correct pronoun is generated when translating anItalian null subject into English?

(Q3) How can we decide when to generate a null pronoun when translating an Englishsubject pronoun into Italian?

The theoretical discussion on the problem of pro-drop within (statistical) machine trans-lation will show which information can be used to solve the problems formulated in (Q2)and (Q3). In the practical part of the work, I will present the method which handlesthe question (Q1). Improved word alignment of English pronominal subjects does notsolve the problems in (Q1) and (Q2). Translations of example sentences will be analyzedthoroughly in order to explain why the improved word alignment does not contribute tothe translation of pronominal subjects between English and Italian.

1.2 Methodology

This work concentrates on the improvement of the word alignment of English and Ital-ian verb phrases consisting of a subject pronoun (cf. question (Q1) in the precedingsection). Since English subject pronouns do not always have Italian counterparts, theyare often aligned incorrectly. I develop therefore a set of rules which de�ne the align-ment of English subject pronouns (cf. section 5.4). Since Italian verbs correspond toEnglish phrases containing the subject pronoun and verbs, the alignment rules compute

6

alignment of entire English and Italian parallel verb phrases (VPs). Alignment rulesde�ne only the alignment of the verbal elements of the VPs and negation. Therefore, Iuse the term VP to denote a part of verb phrases which only contain verbal elementsand negation. The other elements of verb phrases are not handled within this work. Imake three important assumptions for the computation of the VP alignment:

(A1) Each English VP which has a pronominal subject has a parallel Italian VP,

(A2) The base alignment1 is correct enough to allow the identi�cation of English andItalian parallel VPs,

(A3) English and Italian parallel phrases have parallel part of speech sequences.

I use English parse trees and Italian part of speech tagged sentences (an Italian parserwas not available). The program for the computation of the VP alignment is appliedon English and Italian phrase pairs whereas the English phrase must have a pronominalsubject (cf. section 5.3). To assure that the Italian VPs are correct, i.e. that theycontain only verbal elements, I �rst identify all VPs in an Italian sentence by searchingfor PoS sequences which build a VP (cf. section 5.3.1). The parallel Italian VP is thenidenti�ed on the basis of the base alignment (cf. section 5.3.2). The alignment rulescompute alignments for the matching part of speech tags of the phrase pair elements (cf.section 5.4). All links in base alignment for aligned phrase pairs are removed. Then,the alignment computed for the VP pairs is integrated in the base word alignment. Theresulting word alignment of a sentence pair does not have any base alignments for thephrase pairs which are handled by the alignment rules.I evaluate the VP alignment by computing precision, recall and f-score (cf. section

5.5.1). I created gold alignment manually by de�ning the alignment only of relevantEnglish phrases. The alignments of other tokens in a sentence were ignored in theevaluation. The rule based VP alignment outperforms the base alignment. Expressed inf-score, the rule-based VP alignment achieves an improvement of 17% (f-score = 81%).The assumptions (A1) and (A2) do not always hold which leads to false alignments.

Not every English VP has a parallel Italian VP which is a contradiction to (A1). Some-times, the phrases are not translated, or they correspond to other Italian phrases (prepo-sitional phrases, participles, etc.). Since the alignment rules are de�ned only for PoSsequences of English and Italian VPs (cf. assumption (A3)), in such cases, they computefalse alignments. The assumption (A2) can lead to the identi�cation of false phrase pairssince the base alignment is not error-free (cf. section 5.5.2). I show some experimentswhich were carried out in order to solve these problems (cf. section 5.6). For example,to deal with problems with respect to (A1), the base alignment can be retained if theparallel Italian VP could not be identi�ed. In general, the experiments that I carriedout lead to higher recall but lower precision.The base alignment and rule-based VP alignment are used to build statistical ma-

chine translation (SMT) systems for both translation directions. The quality of gener-ated translations is given in BLEU scores. The rule-based VP alignment leads to lower

1Base word alignment is created by GIZA++ (cf. chapter 4.1).

7

BLEU scores, but the manual analyses of the generated translations revealed that thetranslations are nearly the same (cf. section 6.2). This leads to the conclusion thatthe improved VP alignment does not contribute to the translation of pronominal sub-jects between English and Italian. The discussion of the translation probabilities ofthe relevant phrases will show that phrase-based SMT is not an appropriate machinetranslation approach for subject pronoun translation since it does not have access tothe context (preceding sentences) of the input sentence. When translating null subjectsinto English subject pronouns, in many cases, the characteristics of the omitted pronoun(number, gender, person) can be derived from the in�ected verbs (cf. section 3.2.1). But,in general, pronoun resolution (using the preceding sentences) is needed to ensure thegeneration of the correct English pronoun (cf. question (Q2) in the previous section).Furthermore, only the syntactic analysis of the Italian input can provide clear informa-tion whether the Italian sentence has an (omitted) pronominal subject or a NP subject.When translating English pronominal subjects into Italian, the decision must be madeas to whether the Italian subject pronoun has to be expressed overtly (cf. question (Q3)in the previous section). The data observation revealed that some words (adjectives)occur often with overtly expressed Italian subject pronouns (cf. section 3.2.2).

1.3 Outline

This work is organized as follows: in Chapter 2, I introduce the phenomenon of pro-drop.Two theories are brie�y presented which mention a number of linguistic characteristicswhich allow or prohibit pro-drop. In Chapter 3, pro-drop is discussed with respectto machine translation. The features of English and Italian are identi�ed which couldsimplify the generation of correct subject pronouns. In Chapter 4, the characteristics ofphrase-based statistical machine translation are described. Chapter 5 contains a detaileddescription of the rules and of the program for computing word alignment betweenEnglish and Italian verb phrases. The evaluation results of the VP alignment rules arepresented and the most common errors are discussed. In Chapter 6, the evaluation ofthe SMT systems is carried out. I report BLEU/NIST scores and take a closer look atgenerated translations and translation parameters in order to �nd an explanation for falsepronoun translations. Finally, in Chapter 7, the �ndings of the work are summarizedand future work is outlined.

8

2 Pro-drop and Null Subject Languages

In this chapter, I introduce the terms pro-drop and null subject language and presenttwo theories which give an explanation why some languages are able to omit subject(and object) pronouns (cf. section 2.1). In section 2.2, I present syntactic constructionsin English and Italian in which null subjects can occur whereas in section 2.3, functionswhich overtly expressed Italian subject pronouns ful�ll are discussed. Statistics aboutnull subjects in Italian are exposed in section 2.4.

Consider a simple sentence in English with a subject (SUBJ ) and a verbal predicate(VPRED) as shown in (1).

(1) HeSUBJ sleepsV PRED.

The German translation of the sentence in (1) is shown in (2). If we compare the syntaxof these two sentences, we see that both of them have the same sentence elements: asubject and a verbal predicate.

(2) ErSUBJhe

schläftV PRED.sleeps

'He sleeps.'

Let us now take a look at Italian and Croatian sentences which are equivalent to Germanand English sentences in previous examples.

(3) a. EgliSUBJhe

dormeV PRED.sleeps

'He sleeps.'

b. DormeV PRED.sleeps

He/she/it sleeps.

(4) a. OnSUBJhe

spavaV PRED.sleeps

'He sleeps.'

b. SpavaV PRED.sleeps

'He/she sleeps.'

The Italian sentences in (3) are both correct translations of the English and Germansentences above. But there is one important di�erence between them: The sentencein (3a) contains subject and predicate, whereas the sentence in (3b) has only a verbalpredicate. While Italian and, for example, Croatian (cf. examples in (4)) grammarsallow for omission of the subject pronoun, English and German grammars require thesubjects to be overtly expressed. Languages such as Italian, Croatian, Spanish are onlyable to omit subject pronouns. Thus, they are called null subject languages (NSLs).

9

Many Romance (like Italian, Spanish, Portuguese etc.) and Slavic (like Croatian, Czech,Polish etc.) languages belong to this group of languages.There are also languages which allow for omission both of subject and object pronouns

such as Chinese. These are called pro-drop languages. The set of NSLs is a subset ofpro-drop languages.Examples (1) and (2) show grammatically correct sentences of English and German.

However, they would become ungrammatical if the subject pronouns were omitted. Inthese languages, pronoun dropping (pro-drop) is not allowed. English and German areneither pro-drop languages nor NSLs.Let us now take a look at the following German sentences.

(5) ErHe

sagte,said,

dassthat

∅ gefeiertcelebrated

wurde.has been.

'He said that there was a celebration.'

(6) HeuteToday

∅ wirdwill be

gefeiert.celebrated.

'Today, there will be a celebration.'

The dass-sentence (corresponding to the English that-sentence) in (5) does not containa subject. However, the sentence is grammatically correct. In German, there are a fewconstructions which allow the expletive to be dropped, so German can be called a semiNSL. German examples show that in some cases, it is not simple to say if some languageis NSL or not. Some languages like modern Hebrew and Scandinavian languages do notallow zero subject pronouns, however, in a number of constructions they can be omitted[Haegeman, 96].

2.1 Pro-drop theories

In the following, I brie�y introduce two theories that try to explain why some languagesare able to omit subject and/or object pronouns, and some do not exhibit this property.The theories account both for the omission of subject and object pronouns. In furtherdiscussion though, only the omission of subjects will be considered, since this workconcentrates on the problem of translating subject pronouns between a NSL and a non-NSL.

2.1.1 Rich in�ection morphology

It is widely accepted that the possibility of pro-drop often correlates with the existenceof a rich in�ectional morphology (verb-subject, verb-object agreement). The agreementmarking on a verb has to be rich enough to determine, or to allow the recovery of thecontent (reference) of a missing pronoun [Huang, 84]. The Italian example sentence in(7) should clarify this thesis.

(7) Leggoread

una

libro.book.

10

'I read a book.'

Although the subject pronoun in (7) is not phonetically realised, its content has tobe determined. To achieve this, [Huang, 84] proposes the co-indexing of the missingpronoun with the closest nominal element. In our example sentence, this is the Agr(Agreement) of the verb leggo. The verb in (7) can clearly de�ne the person and numberof the missing subject: 1st person singular.Let us take a look at a literal translation of (7) into English.

(8) * Read a book.

The English verb read in (8) cannot unambiguously de�ne the content of the missingsubject pronoun. It is ambiguous and could be combined with the 1st and 2nd person sin-gular and plural and with the 3rd person plural. So we need a lexical element (pronoun)to identify the number and person of the subject.According to this theory, pro-drop languages are also able to omit objects if they

have a verb-object-agreement. Since Italian and English do not exhibit any verb-objectagreement, object pronouns cannot be dropped.In languages like German (cf. examples (5) and (6)) which have some constructions

which allow the omission of the subject pronouns, there is one restriction regarding thesubject pronouns. They can be realized as null subjects only if they are non-referential.[Haegeman, 96] explains this by the fact that the German in�ection is richer than inEnglish but poorer than in Italian. The in�ection may licence null subjects in German,but the verb agreement does not enable us to identify a referent for a null subjectpronoun.The theory about morphological richness and pro-drop holds for many languages, but

there is a group of languages like Chinese or Japanese which have no morphology at all,but still allow for pro-drop. In the next section, I discuss one theory that tries to explainthe ability of pro-drop in the mentioned languages.

2.1.2 Zero topic theory

The zero topic theory proposed by [Huang, 84] is based on the language classi�cation of[Tsao, 77]. [Tsao, 77] proposed that the languages like Chinese may be distinguishedfrom languages like English by a parameter called discourse-oriented vs. sentence-oriented. He observed many properties to group languages into discourse-oriented andsentence-oriented. To these belong also the property of Topic NP deletion which is onlyobserved in languages which are characterized as being discourse-oriented. They allowfor deletion of the topic of a sentence under identity with the topic in the precedingsentence. The ability of a language to map an empty topic to an appropriate precedingtopic is called the topic chain interpretation rule. The grammars of sentence-orientedlanguages lack this topic interpretation rule. Their sentences must have a subject. Thisalso accounts for the presence of the expletive in such languages.[Huang, 84] assumes that languages like Chinese allow binding of empty categories

(which arise when some syntactic elements like subject and object are omitted) with a

11

zero topic. Assuming that a topic can be deleted only if it refers to a preceding topic,we can now recover the content of the missing element.Languages like Italian or Spanish do not have zero topics which could be an explana-

tion of not being able to omit the object pronoun. To recover the content of an omittedelement, we refer here again to the morphology of the language. An empty subject pro-noun can be recovered by examining verb in�ection, but this is not possible for objectpronouns.The theory of [Huang, 84] is thus based on several factors which consider several

properties of a language (zero topics, morphological richness) and some principles andconditions formulated in the government and binding theory of Chomsky (for moredetails, see [Huang, 84]).

2.2 Null subjects and syntax

In the following sections, various syntactic constructions in English and Italian are shownin which the subject pronouns can be omitted.

2.2.1 Null subjects and English syntax

Although English does not belong to the group of NSLs, there are indeed some construc-tions like in�nitival subclauses and imperatives, in which the subject is absent.

(9) a. Speak! (Imperative)

b. I would like [to come]XCOMP .

c. I must [read this]XCOMP .

d. John preferred [seeing Mary]GER.

In English, an empty pronoun may occur only as a subject of an imperative, an in�nitivalclause or of a gerund, but nowhere else. It cannot occur at all as a subject of the tensedclause or as an object [Huang, 84]. However, the subjects in (9b-d) have somewhatdi�erent properties from null subjects as in (10), in so far as the subject of an in�nitivemust be coreferential with the given subject of the main clause (subject control).

(10) a. Joei eats a banana and ∅i watches TV.b. Youi should wash the dishes or ∅i vacuum the apartment.

c. * Joei eats a banana while ∅i watches TV.d. * Youi should wash the dishes although ∅i vacuumed the apartment.

The example sentences in (10) show though that some �nite subclauses, i.e. coordinatedsentences, do not need a subject. In (10a), the subject of the clause watches TV doesnot exist locally, but this kind of construction allows the identi�cation of the subject ofa coordinated sentence with the subject of the main sentence, namely Joe. In contrast,subordinating conjunctions do not provide this kind of subject sharing. The clausesintroduced by a subordinating conjunction require the subject to be overtly expressed(cf. (10c) and (10d)).

12

Yet, examples of subject omission in English �nite clauses can be found in somenonstandard language constructions.

(11) a. - ∅SUBJ cried yesterday morning.

b. Shei is Alsatian. ∅iSUBJ Seems intelligent.

[Haegeman, 00] found out, that English allows null subjects in some special discourseenvironments like short diary entries or notes (cf. sentences in (11)).In this work, I will not deal with this kind of null subjects in English. Nevertheless,

it is important to discuss these constructions to show that there is a gradation ratherthan a hard boundary between NSLs and non-NSLs.

2.2.2 Null subjects and Italian syntax

Italian counterparts to the English sentences in (9) are shown in (12).

(12) a. Parla/Parlate!speak!

(Imperative)

'Speak!'

b. VorreiI would

[venire]XCOMP .come.

'I would like to come.'

c. DevoI must

[leggere questo]XCOMP .read this.

'I must read this.'

d. JohnJohn

preferisceprefers

[di veder Mary]GER.to seeing Mary.

'John prefers to see Mary.'

The examples in (9) and (12) show that there are some syntactically isomorphic con-structions in English and Italian which exhibit the same characteristics regarding theoccurrence of the subject pronoun. But Italian has more constructions in which thesubject pronoun can be omitted.

Finite clauses

(13) Èisstanca.tired

'She is tired.'

(14) Tiyou

hannohave

imbrogliato.cheated

'They cheated you.'

13

Example (13) shows a typical use of the null subject pronoun. The verb è gives informa-tion about the missing subject: It can only be the 3rd person singular. The predicativeadjective stanca reveals another important characteristic about the null subject. Itsending can only match with a feminine subject. Now, we can derive the correct form ofthe subject pronoun although it is not overtly expressed: egla (= she). It is importantto notice that the information about the gender of the missing subject is not alwaysavailable in the sentence (cf. example (3b)). Thus, in some cases, the information aboutthe gender can be only derived if more context of the sentence is available.One interesting fact about the use of subject pronouns in �nite subclauses is shown

in one example sentence of modern Italian in [Vanelli, Renzi, et al., 06], here example(15).

(15) Ilthe

professoreiprofessor

hahas

parlatospoken

dopoafter

lui∗ihe

è arrivato.arrived

'The professor spoke after he arrived.'

[Vanelli, Renzi, et al., 06] claim that it is not possible to unify the subject pronoun inthe subclause with the subject of the main clause. [Roberts, 07] notes though that thisinterpretation is �rather unusual than impossible� (footnote number 2, page 40). If thepronoun is stressed (cf. example (16a)), modi�ed (cf. example (16b)) or coordinated(cf. example (16c)), the reference is possible in the subordinate clause [Cardinaletti &Repetti, 03]:

(16) a. MarioiMario

hahas

dettosaid

chethat

LUIiHE

verràwill-come

domani.tomorrow

'Mario has said that HE will come tomorrow.'

b. MarioMario

hahas

dettosaid

chethat

soloonly

luihe

verràwill-come

domani.tomorrow

'Mario has said that only he will come tomorrow.'

c. MarioMario

hahas

dettosaid

chethat

luihe

e sua madreand

verranohis

domani.mother will-come tomorrow

'Mario has said that he and his mother will come tomorrow.'

Constructions like the one in (14) can also be used as an impersonal construction. Theagreement of the auxiliary hanno identi�es uniquely the subject as the 3rd person plural,but this is not necessarily some speci�c group of referents. Such sentences emphasizethe described fact whereas the subject is irrelevant (or simply not known).

Impersonal expressions

a. With impersonal verbs

(17) Piove.rains

14

'It rains.'

b. Impersonal passive

(18) Èisstato detto

saidchethat

viene.comes

'It was said that he/she comes.'

c. Si impersonale

(19) InIn

ItaliaItaly

si parlaspeaks

italiano.Italian

'In Italy one/people speak(s) Italian.'

Impersonal verbs (sometimes also called weather verbs) do not take any subject at all.The subject in the English translation of (17) is not a true subject. It occurs becausesubjects are obligatory, but it does not have a thematic role. Such impersonal subjectpronouns are also called expletive it. The example in (18) is an Italian construction inwhich a subject pronoun does not occur. In English translation of the sentence, we havean expletive as a subject as in the previous example as well.Another way to express something impersonal in Italian is to use si impersonale. The

re�exive pronoun si in (19) which could be seen as a subject of the given sentence, allowsfor expressing a given fact without specifying the subject.

2.3 Null subjects and pragmatics

The optionality of using subject (and object) pronouns raises the question, why shouldone use them at all. When they occur as subjects, do they ful�ll some speci�c function?If this is not the case, it could be assumed that subject pronouns in Italian can generallybe dropped and are simply never used. In the literature, it is often said that optionalpronouns are used when they are stressed. This explains why expletives, subjects of socalled weather verbs, are not possible in Italian: Since they do not contribute to theinterpretation of the sentence, they would never be stressed, and they will thereforenever be overt [Haegeman, 96].Beyond this explanation for overt subject pronouns, there are some other functions

that overt subject pronouns ful�ll. [Duranti, 84] observed the use of subject pronouns inspoken Italian and speci�ed these functions. Pronouns, nouns and, generally, all de�ningphrases are used to draw attention to some speci�c referent. [Duranti, 84] suggests thatItalian subject pronouns are devices through which speakers de�ne main characters ina narrative and/or convey empathy or positive a�ect toward certain referents. We startwith an example of the common use of zero pronouns.

(20) Miomy

padrefather

è andatowent

a casa.home.

Vuolewants

cucinare.cook.

'My father went home. He wants to cook.'

15

A null subject (or zero anaphora) is typically used for talking about some referent thathas been mentioned in the immediate prior context (usually one or two clauses back).After introducing the referent (in example (20), mio padre) the omitted subject personalpronoun is used to make additional statements about the introduced referent.[Duranti, 84] determined that in some situations the subject pronoun should be used.

In these cases, it has to have some special function. He identi�ed these functions byobserving and analysing sketches of conversations of Italian native speakers.

1. Introducing and keeping track of referents in discourseIf one referent is not a part of the recent context, it can be brought back to thecontext by using the pronoun that refers to it. In this case, the pronoun can be seenas an attention-getting device: It draws the addressee's attention to a particularreferent.

Sometimes, subject pronouns are used although their referents have been men-tioned in the immediate context. In these cases, there is some discontinuity in thetemporal or spatial dimension of a discourse. For example, the pronoun is used forreintroduction of some already mentioned referent, but in a context of some newspeci�c event.

2. 'Main' characters and 'minor' charactersThere is some di�erence in using pronouns for referents who are important ina story (main characters) and for those who are not (minor characters). Themore important the character, the more often is he/she referred to by means ofa personal pronoun. On the other side, for referring to minor characters, NPs ordemonstratives are used.

3. Expressing empathy toward referentBeside the personal pronouns, in Italian one can refer to someone by using demon-strative pronouns. Closer observation of the use of personal and demonstrativepronouns showed that demonstrative pronouns are used to express a certain emo-tional distance or negative a�ect to the referent whereas personal pronouns areused the express empathy with the referent.

[Duranti, 84] also points out that the prior mention of some referent is not a necessarycondition for using a subject pronoun that should refer to someone or something. For ex-ample, in some cases, the 3rd person subject pronoun is used without prior identi�cationof any referent. It can be used for referents that can be implied by a previous identi�ca-tion set. Table 1 from [Duranti, 80] shows how often the referents are introduced beforereferring to them by a null subject pronoun, by a pronoun, and by a noun. The length ofcontext for introduction of the referent has been set to 2 preceding sentences. In 72,5%cases, the null subjects referents can be found in one of the two preceding clauses. Inother cases, the referent is either not mentioned at all, or the distance between the ref-erent and subject pronoun is greater then two clauses. Overt pronouns behave similarlyto nouns. Their referents are rarely mentioned in immediate context.

16

Referent of introduced not introducednull subject (111) 72,1% 27,9%pronoun (29) 34,5% 65,5%noun (62) 27,4% 72,6%

Table 1: Statistics on referents of 3rd person subjects in Italian

2.4 Statistics on null subjects in Italian

In the previous chapter we have seen that the subject pronoun in Italian is rarely used.To get an idea of how often the subject pronoun is omitted, I examined 45 randomlyselected sentences (93 main and subordinate clauses) from Europarl (cf. chapter 5.2). Iidenti�ed sentence subjects and counted how often they are realised as zero pronouns,overt pronouns and nominal phrases (NPs). The results are presented in table 2.

SUBJ-NP SUBJ-PRON null-SUBJ42 (45%) 7 (7%) 45 (48%)

Table 2: Occurrence of SUBJ in Italian

Nearly half of all clauses have zero subjects. The subject pronoun is used in only 7% ofcases. I also examined which zero pronouns are omitted (cf. table 3).

Num/Pers 1 2 3 3PSg 24 ∅ 8 4Pl 4 3 2 -

Table 3: Occurrence of null-SUBJ in 93 observed clauses

The majority of the omitted subject pronouns are for the 1st person singular. This is notreally surprising: The corpus that I worked with (cf. chapter 5.2) consists of parliamentdiscussions in which a certain person exposes his or her opinion about something. Thespeakers speak for themselves so most pronouns are 1st person singular. Sometimes,they also speak for some group of people to which they belong to, e.g. a party. In thesecases, the omitted subject refers to 1st person plural referents. We see that there are nopronouns for 2nd person singular. This is also not surprising because in such meetings,people do not address each other informally.Regarding the 3rd person singular, we have to distinguish between the polite form in

Italian (column 3P in table 3) which is expressed by 3rd person singular when only oneperson is the addressee. The other cases of 3rd person singular pronouns refer either tosomeone or something already mentioned, or they correspond to English expletives.

17

Now, let us take a look at the clauses in which the subject pronoun has not beenomitted. In a set of 95 examined clauses, I found 7 occurrences of overt subject pronouns,three of these are the polite form. Let us take a closer look to these sentences.

(21) Sì,yes,

onorevolehonourable

Evansi,Evans,

...

...,,chethat

leiiyou

proponesuggest

...

...

'Yes, honourable Evans, ... , you are suggesting ...'

(22) Onorevolehonourable

Lynnei,Lynne,

leiiyou

hahave

perfettamenteperfectly

ragioneright

...

'Honourable Lynne, you are perfectly right ...'

(23) Onorevolehonourable

collegacolleague

BarónBarón

Crespoi,Crespo,

leiiyou

nonnot

ha potutocould

partecipareparticipate

...

...

'Honourable colleague Barón Crespo, you couldn't participate ...'

Examples (21) - (23) show that the referents of the 3rd person singular pronoun in sub-clauses are situated in the same sentence. This is rather unusual if we refer to theobservations of [Duranti, 80]. I assume that the subject pronoun is used here to disam-biguate the referent which can serve as a subject of the 3rd person singular verbs: theNP introduced in the main clause or a referent from the preceding context (sentences).Another three occurrences of pronouns are in the 1st person singular or plural.

(24) Noiwe

tuttiall

siamoare

lietipleased

...

...

'We all are pleased ...'

(25) ......

chethat

proprioourselves

noiwe

nonnot

rispettiamoadhere to

...

'... that ourselves not adhere to ...'

(26) ......

l'the

onorevolehonourable

DíezDíez

GonzálezGonzález

eand

ioI

avevamohave

presentatopresented

...

...

'Honourable colleague Díez González and I have presented ...'

Examples (24) and (25) show that the pronouns are used to stress something, e.g. thesubject of the sentence. It is peculiar that the pronouns occur with adverbs like tuttiand proprio that in some way emphasize the subject. The subject of the sentence in(26) di�ers from the subjects we observed until now. The Italian subject pronoun io isused as a part of the coordinated subject NP which also consists of the NP l' onorevoleDíez González. As a part of a coordinated subject NP, the subject pronoun cannot beomitted.Finally, there is one occurrence of the subject pronoun of the 3rd person singular:

(27) ......

chewhich

essoit

stessoitself

approva.adheres to.

'... which itself adheres to.'

The last example shows that the pronoun is also emphasized, in this case by an adjectivestesso. Similar cases of emphasis have already been shown in examples (16b) and (16c).

18

2.5 Summary

Pro-drop is a linguistic phenomenon which can be found in many languages. Somelanguages allow for omitting both subject and object pronouns (pro-drop languages)whereas some languages like Italian permit only the subject pronoun to be omitted.Italian is therefore called a null subject language (NSL). On the other hand, we haveobserved that some languages like English must have overtly expressed (pronominal)subjects. English belongs to the group of not-null subject languages (non-NSL). WhereasEnglish morphology is not rich enough to allow the recovery of the characteristics ofthe missing subjects, the Italian verb in�ection enables the derivation of the linguisticcharacteristics (for example, number and person) of the omitted pronominal subject.The analysis of subject pronouns in the given language pair showed that English as a

non-NSL also has constructions in which the subject can be omitted (cf. examples (9) -(11)). However, these constructions are not relevant for this work in which I deal solelywith �nite English sentences which do not allow for omitted subjects.The analysis of Italian sentences revealed that the pronouns in Italian (according

to the observed corpus) are omitted in most cases (cf. table 2). If they are overtlyexpressed, they are often emphasized by underlying adjectives or adverbs (cf. examplesin (24) and (25)). In speci�c contexts, the 3rd person pronoun lei is used to enableunambiguous identi�cation of the NP that it refers to (cf. examples (22) and (23)).The di�erence in the usage of subject pronouns in Italian and English (cf. example (7))

leads to problems in machine translation (MT). In the following chapter, the problem ofpro-drop within MT is discussed. After previous work on pro-drop in MT is presented,di�erent cases of problems regarding the translation of pronominal subjects in bothtranslation directions IT → EN and EN → IT are shown.

19

3 Pro-drop in machine translation

In this chapter, subject pronoun omission within machine translation (MT) is discussed.Although this work concentrates on statistical machine translation, I discuss previouswork regarding pro-drop in di�erent MT systems. In section 3.2, a detailed analysis ofpronominal subject translation between English and Italian is carried out. Example sen-tences consisting of pronominal subjects have been translated by the rule-based systemSystran2 and statistical MT systems Google Translate3 and Moses4.

When Italian null subjects are translated into English, their properties like number,person and gender have to be derived in order to generate the correct English subjectpronoun. For human translators, it is relatively easy to do this, since they are ableto de�ne the person, animal or thing to which the omitted subject pronoun refers to.These referents are not necessarily in the same sentence: They can occur in one of thepreceding sentences. Problems occur when single Italian sentences containing a nullpronoun should be translated. Without context and access to the world knowledge, it ispossible to derive the right person and number of the omitted pronoun. But, for example,if it is known that the missing pronoun is 3rd person singular, but we do not know whichgender the pronoun has, how can we decide if we should translate the missing pronounas a feminine pronoun she or as a masculine subject pronoun he?When the translation task is in the other direction, the decision must be made if the

Italian pronominal subject should be expressed overtly or be dropped. Furthermore, thegender discrepany between English and Italian can lead to the generation of incorrectItalian pronouns (for example, 3rd person pronouns).Machine translation is confronted with the same problems when translating between

a non-NSL English and a NSL Italian. Most MT systems operate on the single sentenceinput and do not use previous sentence context. When translating into English, thecorrect pronoun for a null subject in Italian has to be found. But often, the context(previous sentences) of an observed sentence should be taken into account to resolve themissing pronoun. When translating into Italian, it has to be determined if the subjectpronoun should be generated or omitted.We summarize the questions that have to be answered:

(Q1) Automatic word alignmentHow to align the existing subject pronoun in non-NSL (English) with an omittedsubject in NSL (Italian)?

(Q2) Translation: NSL → non-NSLHow can we automatically generate the right subject pronoun in the target lan-guage for the missing subject pronoun in the source language?

2http://www.systranet.com/3http://www.google.com/language_tools4I built a baseline SMT system with Moses (cf. chapter 6.2).

20

(Q3) Translation: non-NSL → NSLWhen should the non-NSL subject pronoun be omitted in the NSL target lan-guage? The answer to this question is important if we want to achieve that theautomatically generated translations sound natural.

3.1 Previous work on zero pronouns in MT

The problems regarding automatic translation of null subjects from a NSL to somenon-NSL and vice versa, have been dealt with only indirectly.[Goldwater & McClosky, 05] dealt with the statistical machine translation of the lan-

guage pair Czech (NSL) and English. The aim of their work was to �nd out if thetranslation from Czech, a morphologically rich language, to English, which is a languagewith weak morphological in�ection, can be improved if the morphological informationis available. Their idea was to use morphological analysis on Czech. The Czech inputhas been lemmatized and pseudowords have been inserted in order to eliminate somemorphological di�erences between the two languages and to deal with the sparse dataproblem. These pseudowords are morphological tags that express some speci�c proper-ties. [Goldwater & McClosky, 05] inserted the pseudowords with information about theverb person (among others) to the Czech input. The pseudowords should simulate the ex-istence of pronouns for the English pronouns to align with. [Goldwater & McClosky, 05]reported that person pseudowords indeed have been aligned to English pronouns withhigh probability. However, it has not been reported if these pseudowords solve all prob-lems regarding null subjects. The question is how often the null subjects are correctlytranslated. Erroneous translations are possible when ambiguous verbs should be trans-lated, or when the referents of the omitted subject pronouns have a di�erent grammaticalgender. For the opposite translation direction this approach could be somewhat prob-lematic: If English pronouns are in most cases aligned to Czech pseudowords (withsurface form ∅), this translation alternative receives high likelihood. Are then (nearly)all English subject pronouns translated as null subjects in Czech?Another work on translation between NSL (Spanish) and non-NSL (English) has been

done by [Peral & Ferrández, 03]. They developed a system which identi�es and resolvesall pronouns (not only the omitted subject pronouns) in Spanish as a source language.Their translation system is based on an interlingua approach. The input text undergoesseveral analysis steps: morphological analysis, POS-tagging, parsing and word-sensedisambiguation. The enriched input text serves as input to a component which dealswith di�erent NLP problems like anaphora identi�cation and resolution. After dealingwith anaphora the generation of the interlingua representation of the whole input text iscarried out. This representation contains all information needed to translate pronounsin the target language. Although the authors report very good results in the tasks ofanaphora identi�cation and generation, there are some additional problems that theirMT-system had to solve. For example, if it is clear that the omitted subject pronounin Spanish as source language is 3rd person feminine, this does not mean automaticallythat the correct English pronoun should also be of the same gender (e.g. elmasc withthe referent el perromasc vs. itneut with the referent dogneut). In English, animals have

21

neutral grammatical gender. So, we have to have the information that the referent of elis an animal in order to correctly translate the pronoun (possibly an omitted pronoun)in English. Evaluating their system, [Peral & Ferrández, 03] translated all occurrencesof English (as source language) pronouns into their Spanish equivalents. They notethough that a subsequent task must decide if the pronoun in Spanish must be generated,substituted by some other pronoun or must be eliminated.[Nakaiwa & Ikehara, 92] developed an anaphora resolution system for Japanese (a

pro-drop language) and integrated it into a machine translation system for Japaneseto English called ALT-J/E. The anaphora resolution process is based on semantic at-tributes of verbs and their relationship to the arguments. For each verb it is necessaryto determine its semantic category and its relationship to its arguments (SUBJ, OBJ).These arguments can be the anaphora and nominal phrases. Rules allow the derivationof the correct referent for a particular anaphora, which can be a zero pronoun, usingthis information. For example, let us assume that we want to resolve an anaphora aigoverned by some verb vi with some semantic attribute vsai. ai is a subject of vi. Wehave the same information about some verb vj of a so called topicalized unit sentence5

which governs some phrase which could be a referent of ai. Given this information, therules are searched in order to �nd the right referent for ai. The rules have the followingform: If vi has a verb category vsai and governs an anaphora ai as its argument argi (e.g.SUBJ) and we have some verb vj with verb category vsaj, then the argument argj (e.g.OBJ) of verb vj can be assumed to be a referent of ai. To apply these rules, the verb inthe sentence with zero pronoun and the verb of the unit sentence have to be extracted.Their verb categories are identi�ed. According to the rules describing verb relationshipsas sketched above and the identi�ed verb categories, the referent of the zero pronoun isestablished.When translating the resolved zero anaphora (i.e. their referents), it could happen

that the translation in English becomes verbose. In this case, elliptical pronouns andde�nite articles should be used [Nakaiwa & Ikehara, 92]. This leads again to the problemof generating the correct English subject (and object) pronoun.

3.2 Translation between English and Italian

In this chapter, I will describe di�erences between English and Italian regarding the nullsubject that cause problems for automatic translation between the two languages. Someof the cases have already been mentioned in the preceding discussion. Now, we lookat concrete examples and translations that three MT systems provided: S - the rule-based MT system SYSTRAN6, G - the statistical MT system Google translator7, andM - the statistical MT system Moses (cf. chapter 4). Translation under R representsthe reference. Some of the source language sentences are extracted from Europarl (cf.section 5.2) whereas a part of them were constructed by myself.

5This is a sentence that contains nominal phrases which can serve as referents of the anaphora in thefollowing sentences.

6Free translation at http://www.systranet.com/ (November 2009).7Free translation at http://www.google.com/language_tools (November 2009).

22

Since the example analysis describes linguistic knowledge needed for resolving someproblems regarding null subjects, it is important to point out that phrase-based statisti-cal MT systems in their original form do not have access to any linguistic knowledge, sothey are certainly disadvantaged when linguistic knowledge is needed to generate correcttranslations. Rule-based systems are more likely to recognise which pronominal subjectcan occur with a given verb form.

3.2.1 Italian to English

We already know that in Italian, the properties of the missing subject like number andperson can be derived from the verb in�ection (cf. section 2.1). We will now examinehow well this works in available MT systems. The words set in bold in the Italian inputsentences are �nite verbs. The pronouns in bold in the English translations representsubjects corresponding to the omitted subject in Italian.

First person subject pronounsLet us begin with the omitted pronouns of the �rst person singular and plural.

(28) �So che il governo americano condivide i nostri obiettivi.�R: I know that the American government shares our goals.G: I know that the U.S. government shares our goals.S: I know that the government American shares our objectives.M: I know that the american government shares our objectives.

(29) �Hanno compreso, come noi, quanto sia importante che svolgiamo insieme ...�R: They understood, as we did, how important it is that we carry out together ...G: They understood, like us, it is important that we do together ...S: They have comprised, like we, how much is important that we carry out ...M: They understood, as we, how important it is that *∅ perform together ...

All translations but one are correct. In (29), Moses does not generate the subjectpronoun of the verb perform. Verb forms for 1st person singular and plural are notambiguous, so that the right pronoun in English can be derived from the analysis ofItalian verb form.8

The translation possibilities can be summarised as shown in (30).

(30) IT.Verb.1.P.Sg → I + EN.Verb.1.P.SgIT.Verb.1.P.Pl → We + EN.Verb.1.P.Pl

Second person subject pronounsLet us go on with the second person singular and plural.

(31) �Hai detto che parli italiano.�R: You said that you speak Italian.

8An explanation for false Moses output is given later in chapter 6.

23

G: You said that you speak Italian.S: You have said that it speaks Italian.M: You have said that *∅ speaks Italian.

(32) �Avete giocato con i genitori.�R: You played with parents.G: You played with their parents.S: You had played with the parents.M: *You with their parents.

The VPs (auxiliary + participle) in the main clauses in example sentences (31) and(32) can be uniquely translated into English. In Italian subclause in (31), we facean ambiguous verb parli : It can occur with the 2nd person singular, as recognised byGoogle. But, as a subjunctive, parli can furthermore occur with the 3rd person singular,as recognised by Systran. Moses does not generate any subject pronoun leading to thegrammatically incorrect subclause translation.Beyond the ambiguity regarding some verbs in indicative and subjunctive, there is

another problem regarding verbs in present tense. The indicative and imperative verbsfor the second person are the same.

(33) �Dite che parlate italiano.�R: Say that you speak Italian.G: ∅ Say you speak Italian.S: You say that *∅ speeches Italian.M: You say that you are italian.

(34) �Dite se parlate italiano.�R: Say if you speak Italian.G: ∅ Say if you speak Italian.S: You say if *∅ speeches Italian.M: You say if *∅ spoken italian.

(35) �Scrivi una lettera.�R: You are writing a letter.G: *∅ Write a letter.S: You write a letter.M: *∅ Refer a letter.

(36) �Scrivi una lettera!�R: Write a letter!G: ∅ Write a letter!S: *You write a letter!M: ∅ Refer a letter!

The only di�erence between (33) and (34) is the conjunction used: che (= that) andse (= if ). Whereas the conjunction che could be used both in an indicative and animperative sentence, the conjunction se should instead be used with the interpretationof the verb dite as imperative. So, the Google translations are both acceptable, but

24

Systran's are not. Whether the subject of the main clause in (33) should be used (forindicative reading) or not (for imperative reading) cannot be derived directly. Thiswould be probably easier if we had access to the context of the given sentence. If thesentence mode is marked by punctuation, it is possible to derive the right sentencemode (cf. examples (35) and (36)). Unfortunately, the MT systems do not seem to usethis information for deciding whether the subject in English should be generated (forindicative) or not (for imperative).Let us summarise the translation alternatives for the omitted subject for the 2nd

person singular and plural.

(37) IT.Verb.2.P.Sg → You + EN.Verb.2.P.Sg (indicative)IT.Verb.2.P.Sg → ∅ + EN.Verb.2.P.Sg (imperative)IT.Verb.2.P.Pl → You + EN.Verb.2.P.Pl (indicative)IT.Verb.2.P.Pl → ∅ + EN.Verb.2.P.Pl (imperative)

Third person subject pronounsThe most complicated case is that of the 3rd person pronouns that have been omitted.We will start with the cases in singular.

(38) �Dice che parla italiano.�R: He/She says that he/she speaks Italian.G: She says she speaks Italian.S: It says that *it speaks Italian.M: *∅ Says that *∅ speaks Italian.

(39) �Pensa che non è malata.�R: He/She thinks that she is not ill.G: *∅ Think that is *∅ not sick.S: *It thinks that *it is not sick.M: *∅ Does that *∅ not is sick .

Examples (38) and (39) already show the limitations of the tested systems regardingthe null subject. Indeed, in the �rst example, it is not possible to derive the gender ofthe missing subject pronoun. Google proposes the pronoun for 3rd person feminine assubject for both subclauses in the source sentence. Since we do not know anything aboutthe context of the sentence, we can accept this solution.9 The translation that Systransuggested has at least one error. The proposed subject for the main clause can be seenas correct if the subject refers, for example, to some book or note or the like. Knowingthough, that only humans can speak, the subject pronoun it for the subclause cannotbe correct. The Moses translation does not contain subject pronouns and is thereforegrammatically incorrect.In contrast to example (38), at least the subclause in (39) provides all information

needed to generate the right subject pronoun in English. Predicative adjectives which

9I have been told by a native speaker of Italian that masculine is used when a decision about thegender cannot be made.

25

occur with copula verbs match in number and gender with the referents that they modify.So, it is possible to determine the subject of the subclause in (39) as feminine singular.The verb provides the information that the subject is in the 3rd person, so we can clearlysay that the subject in English translation should be she. Concerning the subject of themain clause, the translation should be at least he or she if we assume that only humanshave the ability to think.The property of Italian described for the subclause in (39) holds also for composed

tense forms which take essere (= be) as an auxiliary.

(40) �È andata a casa.�R: She went home.G: *∅ Went home.S: *It has gone to house.M: *∅ Has gone home.

(41) �Era rimasto a scuola.�R: He stayed at school.G: He had stayed in school.S: *Era remained to school.M: *∅ Remained at school.

The underlined participle in (40) provides information about the gender of the omittedsubject pronoun. Together with the information which the in�ected verb È provides, itis possible to identify the subject as 3rd person singular feminine: she. The same formof the analysis for the verb Era and the participle rimasto leads us to the conclusionthat the subject in English in (41) should be he.The 3rd person singular is additionally used in the polite form of address. It is used

with Italian 3rd person pronouns lei which is unfortunately also a pronoun for the 3rd

person singular feminine. So, this is another case of ambiguity to deal with.

(42) �Lei non è stata a casa?�R: Was she not at home?G: She was not at home?S: Hasn't *it been to house?M: *You was not at home?

Google translator recognises the subject pronoun Lei as 3rd person singular femininewhich is one interpretation alternative of this pronoun. The other translation possibility,namely as you is found by Moses but the generated pronoun does not match with thecorresponding verb was.The next examples show impersonal constructions in Italian. We begin with an ex-

ample of a so called weather verb.

(43) �Piove.�G: *∅ Rains.S: It rains.M: *∅ Rain.

26

Weather verbs as in (43) need expletives in English. Only Systran generates the correctsubject pronoun for the example sentence in (43).Let us now examine the si sentences and their English equivalents. The �rst three

examples contain intransitive verbs. These constructions are called si impersonale.

(44) �In Germania si beve la birra.�R: In Germany, people drink beer.G: * In Germany, ∅ drinking beer.S: In Germany the beer is drunk.M: In Germany we drink beer.

(45) �In Germania si è letto molto..�R: In Germany, people have read a lot.G: In Germany *you have read a lot.S: In Germany a lot has been read.M: Germany has read.

(46) �Quando eravamo studenti, si è andati a scuola.�R: When we were students, we went to school.G: When we were students, *he went to school.S: When we were students, it has been gone to school.M: When we were studenti, *∅ has gone to school.

Examples (44) - (46) show the use of si impersonale. The subjects in the Englishtranslations of (44) and (45) should be people or one. The translations of the mainclause in (46) are correct, but the translations of the subclause are a bit problematic.The subclause consists of the �nite verb for the 3rd person singular and the participleandati that matches a subject in plural. MT systems use only the information aboutthe �nite verb and generate the corresponding pronouns in the target language, thoughthey have di�erent values for gender.But if the VP è andati refers to the same set of referents as in the main clause, the

pronoun we should be used as a subject of the subclause. This is not trivial since we aredealing with the verb è, which needs a subject of the 3rd person singular, but we wantto generate a pronoun of the 1st person plural in the target language.Until now, we have taken a look only at cases of 3rd person singular. In (47) and (48)

follow examples for 3rd person plural.

(47) �Hanno cantato la mia canzone.�R: They sang my song.G: They sang my song.S: They have sung my song.M: My song have been sung.

(48) �Sono state in Croazia.�R: They were in Croatia.G: *∅ Were in Croatia.S: They have been in the Croatia.M: *∅ Were in Croatia.

27

The only alternative for translating 3rd person plural in English is they. All information(3rd person plural feminine) can be derived for the subject in the example sentence (48).Since there are no gender distinctions for 3rd person plural in English, this translationcase is unambiguous and should be they.Let us now summarise the observations made by examining examples (38) - (48).

(49) IT.Copula.3.P.Sg + IT.PastPart.F → She + EN.Verb.3.P.SgIT.Copula.3.P.Sg + IT.PastPart.F → You + EN.Verb.2.P.Sg (polite)IT.Copula.3.P.Sg + IT.Predicative.F → She + EN.Verb.3.P.SgIT.Copula.3.P.Sg + IT.Predicative.F → You + EN.Verb (polite)IT.Copula.3.P.Sg + IT.PastPart.M → He + EN.Verb.3.P.SgIT.Copula.3.P.Sg + IT.Predicative.M → He + EN.Verb.3.P.SgIT.Verb.3.P.Sg → He/She + EN.Verb.3.P.Sg (if only human referents possible)IT.Verb.3.P.Sg → It + EN.Verb (if human referents not possible)IT.si + IT.Verb.3.P.Sg → one/people + EN.Verb.3.P.Sg/PlIT.Impers.3.P.Sg → It + EN.Verb.3.P.SgIT.Verb.3.P.Pl → They + EN.Verb.3.P.Pl

There is another interesting construction in Italian which does not contain a subject,namely the negated imperative for 2nd person singular.

(50) �Non mangiare nelle ore di lezione!�R: Do not eat in the hours of lessons!G: ∅ Do not eat in the hours of lessons!S: ∅ Not to eat in the hours of lesson!M: ∅ Do not eat in hours of lesson!

The negated imperative form for the 2nd person singular consists of the negation nonand the in�nitive, in our case mangiare. This kind of sentences should be translated bya do not ... construction, as Google translator suggested. Though Systran's translationdoes not have a subject, which is correct, it also contains an in�nitive marker to whichmakes the sentence grammatically incorrect. The analysis of the example in (50) leadsto the following rule:

(51) IT.non + IT.in�n → ∅ + do not EN.in�n

3.2.2 English to Italian

As already mentioned at the beginning of the chapter, the main question in translationdirection EN → IT is whether the Italian subject pronouns should be generated oromitted. In principle, they could always be generated or always omitted. Both ofthese decisions are not ideal: Whereas the omission of all subject pronouns can leadto problems with respect to the adequacy of the translations, the generation of allsubject pronouns would very likely result in a text that sounds rather unnatural. Atext consisting of a sequence of sentences in which almost each sentence has a subjectpronoun contains a lot of redundant information (number, person, gender) coded at the

28

same time both in the subject pronouns and in the �nite verbs. So, the subject pronounsshould be omitted to avoid the redundancy and to preserve the text �uency.If just one isolated sentence should be translated, it is rather imaginable that such

a sentence contains a subject pronoun. The explicit occurrence of the subject pronounin such isolated sentences can be explained by the fact that without the context, it isnot possible to determine the referent which the omitted subject pronoun refers to. Insuch a context, the use of a subject pronoun can thus be compared with the use ofa NP subject. It introduces a referent and provides information about it. In isolatedsentences, this information can only be provided by the referent that is situated in thegiven sentence.Since translation is more often carried out on a text, it should be examined in which

contexts, the subject pronoun should be dropped or realized overtly. In our discussionso far, we saw that the use of a subject pronoun has often pragmatic reasons (cf. section2.3) which are not easy to capture in an automatic translation system.Some cases in which the pronoun is overtly used have already been shown and discussed

in section 2.2.2. A much more detailed examination is needed to �nd the contexts inwhich subject pronouns in Italian are used. The pronoun triggers shown in (16b), (24),(25), (27) have to be identi�ed and it should be investigated how probable is it that theyreally occur with the subject pronoun.This kind of rather local regularity can be captured by the SMT systems. They work

on the word level and can identify word sequences which are often translated to eachother. So, if itself corresponds relatively often to the phrase esso stesso, it has a goodchance to be translated to it without using the heuristics to decide whether the pronounshould be generated.10

3.3 Summary

In MT, the problem of pro-drop has been dealt with only marginally. But in my opinion,this is an important issue since the absence of the subject (pronoun) in a non-NSLleads to grammatically incorrect sentences. If the subject is not generated because thecorresponding element in the source language does not exist, it should be examinedwhich information in the source language could be used to generate the correct subjectpronoun. The analysis of source sentences of a NSL, Italian, (cf. section 3.2.1) showedthat in many cases, Italian verbs bear quite a lot information to enable the generation ofthe correct English pronoun. However, in a number of cases, Italian verbs are ambiguousand require therefore the observation of the context (preceding sentences) in order toderive the correct English subject pronoun.When an Italian text should be generated out of an English input, it has to be deter-

mined if the subject pronouns should be absent or not. Since their use has pragmaticreasons, more detailed analysis of Italian is needed to answer this question.In the following chapter, the details of statistical machine translation are sketched.

In chapter 5, a method for the word alignment of Italian and English VPs is described.

10Details on phrase-based SMT are discussed in chapter 4.

29

SMT systems are build to test if the rule-based VP alignment contribute to bettertranslation of pronominal subjects between Italian and English. The evaluation resultsof the systems are shown in chapter 6.

30

4 Statistical machine translation

This chapter describes phrase-based statistical machine translation (SMT). In section4.1, the statistical models for the automatic word alignment are introduced. We take acloser look atGIZA++, the open source word alignment tool developed by [Och & Ney, 03]since this tool was used to create a baseline word alignment which has been improvedby applying the alignment rules described in chapter 5. In section 4.2, the concept ofphrase-based SMT is described. The phrase-based SMT approach is implemented in anopen source SMT systemMoses [Koehn et al., 07] which has been used within this work.

4.1 Word alignment

Word alignment is a very important task within SMT. In the training process of an SMTsystem, it is necessary to identify word equivalences to gain the translation tables whichare needed in the translation process. Phrase-based SMT systems (cf. section 4.2) usethe word alignment to extract translation phrases (word sequences). So, the quality ofthe word alignment is crucial for extracting good parallel phrases.There are �ve statistical models, so called the IBM Models, which are used to automati-

cally compute the word alignment of a parallel sentence-aligned corpus [Brown et al., 03].Word alignment models are trained by the Expectation Maximization Algorithm (EM).The EM contains of two steps: (i) expectation in which the alignment model is appliedto the data, and (ii) maximization in which the model parameter are recalculated. Thesimplest way to start the EM training is to assume that all words are equally probableto be aligned to each other. The model is applied to the data resulting in the wordaligned parallel corpus. On the basis of the counts of the alignment pairs, the lexicaltranslation probabilities are re-estimated. These recalculated model parameters are usedas the model for the next iteration. The algorithm stops when convergence is reached.In the �rst statistical word alignment model, IBM Model 1, the sentences are treated

as a bag of words which means that the word order does not play any role in the wordalignment process. The improvement of this model leads to the Model 2 in which thetarget word also depends on its position in the TL sentence. Since some words can bealigned to a sequence of words in some other language, it is desirable to model and allow1 − to − n alignments. This is done by modeling the word fertility in the Model 3. InModel 4, the position of the previously translated word is taken into account. In thefollowing, the IBM models for the word alignment are brie�y described.11

IBM Model 1When computing word alignment of a sentence pair, we are interested in the mostprobable alignment a for a sentence pair containing the target language (TL) sentencee = (e1, ..., ele) and the source (SL) sentence f = (f1, ..., flf ). Formally, we need tocompute the alignment probability p(a|e, f) (cf. equation (1)).

11For more detailed discussion about the methods in statistical machine translation, please refer to[Koehn, 09].

31

p(a|e, f) =le∏j=1

t(ej|fa(j))∑lfi=0 t(ej|fi)

(1)

Equation (1) uses the lexical translation probabilities t(ej|fi) which express the proba-bility of generating the TL word ej from the SL word fi. Furthermore, the numeratort(ej|fa(j)) models the probability of generating the word fi from the word ej given analignment function a(j) = i.After the most probable alignment of a sentence pair is computed using equation (1),

the model parameters are re-estimated. The weighted counts c(e|f ; e, f) for translating aparticular SL word f into a particular TL word e in the sentence pair (e, f) are collected.Having these counts, new translation probability t(e|f) can be estimated. As the initiallexical probability distribution, the uniform probability distribution is taken indicatingthat every TL word is equally likely to be generated out of each SL word.

IBM Model 2IBM Model 1 does not incorporate any knowledge about the word order in the targetsentence. On contrary, IBM Model 2 has an explicit model for an alignment based onthe position of the input and output words (cf. equation (2)).

a(i|j, le, lf ) (2)

The alignment probability distribution in (2) models the probability of translating somesource word in the position i in a target word in a position j. The model predicts thesource word positions conditioned on the generated target word positions. ExpandingIBM Model 1 with the position based alignment probability distribution shown in (2),we become a new equation for computing the most probable alignment a for a sentencepair (e, f). The equation is shown in (3).

p(a|e, f) = ε

le∏j=1

t(ej|fa(j)) a(a(j)|j, le, lf )∑lfi=0 t(ej|fi) a(a(j)|j, le, lf )

(3)

As in Model 1, new lexical translation probabilities are estimated from the weightedcounts for lexical translations c(e|f ; e, f). Additionally to the lexical translations, theposition based probability distribution is computed using the counts for the translationof the words in speci�c positions: c(i|j, le, lf ; e, f). As the initial lexical probabilitydistribution, Model 2 uses the lexical probabilities computed by Model 1. The positionbased alignment probabilities are initialised as 1

lf+1.

32

IBM Model 3IBM Model 3 contains of an additional model which expresses the fertility of a sourceword. It contains probabilities of translating a source word in one or two or more targetwords. An arti�cial fertility probabilities for the Italian word all (= to the) is shownin (4). The probability that all generates two English words is much higher than theprobability that it generates only one English word.

n(2|all) = 0.8 (4)

n(1|all) = 0.2

The fertility model allows also insertion of target words that do not have a counterpartin a source sentence. These words are treated as being generated from a special tokenNULL with fertility n(φ|NULL). Additionally, the fertility model permits that a sourceword is not translated at all. With other words, it can be dropped. This is expressedby a fertility n(0|w), where w is a source word.Instead of the alignment probability distribution in Model 2, Model 3 consists of a

distortion probability distribution d(j|i, le, lf ) which predicts target word positions basedon the source word positions.For the re-estimation of the model parameters, only the most probable word align-

ments for a sentence pair (e, f) are used. As the initial lexical probability distribution,the estimates form Model 2 are used. Since in the �rst iteration step, the distortionprobabilities are not available, the alignment probabilities estimated by Model 2 areused as starting distortion probability distribution.

IBM Model 4IBM Model 4 introduces a relative distortion model which is an improvement of anabsolute distortion model from IBM Model 3. Absolute distortion model does not dowell when large source and target sentences are dealt with. The movement probabilitiesfor such sentence pairs are sparse and not very realistic [Koehn, 09].Since the position of a generated target word depends in particular on the position

of the generated word for a preceding source word, Model 4 introduces a distortionprobability distribution based on the position of the alignment of the previous sourceword. The distortion model implemented in IBM Model 4 is based on cepts. A ceptconsists of a source word fj which is aligned at least with one target word. A center ofa cept i (denoted by �i) is de�ned as the ceiling of the average of the word positions.Relative distortion d1 of a target word ej in a position j, which is also the startingposition of a cept i, is de�ned as shown in (5).

d1(j −�i−1) (5)

If a target word ej is not the start element of a cept, its relative distortion is de�nedas shown in (6). With the term πi,k−1, we refer to the position of the preceding targetword in the cept which ej belongs to.

33

d>1(j − πi,k−1) (6)

Computed relative distortion values d1 and d>1 express the movement of a target wordej depending on the position of the preceding target word ej−1.The training of the model starts with the estimates of the Model 3 as the initial model

parameters. As in Model 3, the most probable alignments are computed from which thecounts for the parameter re-estimation are gathered.

GIZA++The basis for the presented work poses the base word alignment computed by the systemcalled GIZA++ developed by [Och & Ney, 03]. It is a combination of the Model 1, aHMM (Hidden Markov Alignment Model) pHMM shown in (7) and the Model 4 p4.

pHMM(f, a|e) = p(B0|BI1) ·

I∏i=1

p(Bi|Bi−1, ei) ·I∏i=0

∏j∈Bi

p(fj|ei) (7)

In HMM, inverted alignments BI0 are used for representation of the alignment aJ1 . They

represent the mapping from a TL word to a SL word. Bi is a partition of the SL sentencemarking the word (sequence) of a SL. The alignments with empty words are modeled bythe probability distribution p(B0|BI

1), where the set B0 contains of all positions of SLwords which are aligned with the empty word. p(Bi|Bi−1, ei) expresses the probability ofSL word (sequence) Bi given the translation of the preceding SL word (sequence) Bi−1

and a target word ei.GIZA++ combines Model 1, HMM and Model 4. First, the parameters for the Model

1 are computed. They serve as the initial model parameters for HMM. The estimates ofthe HMM are �nally used in Model 4 for deriving the �nal model parameters.To allow n−to−m alignments, the alignment symmetrization is carried out. The word

alignment is carried out in both directions. In the next step, the produced alignmentsare combined to compute the output alignment. In GIZA++, the intersection of thealignments is computed. Thus, the alignments which are a part of both alignments aretaken. These alignments are considered to be very reliable since they can be found inboth alignments. After these links are identi�ed, the alignments for the neighbouringwords are computed using the union of the two alignments (re�ned symmetrization)[Och & Ney, 03][Koehn et al., 03].In this work, GIZA++ has been applied to the English-Italian parallel corpus pro-

ducing the baseline word alignment which has been partially improved (cf. section5). [Pianta & Bentivogli, 04] evaluated the statistical word alignment computed byGIZA++ for Italian and English. They used a corpus consisting of 25,000 sentencepairs. Table 4 shows the evaluation results.As a symmetrization method, [Pianta & Bentivogli, 04] used the intersection of the align-ments computed for English → Italian and Italian → English. The reported results onthe word alignment evaluation show that the GIZA++ word alignment for English-Italian lets some room for improvement.

34

Alignment Precision RecallIT → EN 73.4 55.2Intersection 95.2 38.8

Table 4: Evaluation of GIZA++ word alignment for English and Italian

4.2 Phrase-based SMT

The SMT belongs to the group of word-based machine translation systems. This meansthat the input sentence that should be translated does not undergo any analysis (syntac-tic, semantic), but it is translated word-by-word. A large bilingual dictionary is neededto carry out word-by-word translation. There are many cases in which word-by-wordtranslation fails. One word in SL does not always correspond to only one word in TLwhich also holds for the opposite translation direction. This leads to an assumption thatinstead of words, the phrases, word sequences, should be translated as one translationunit. These phrases are not necessarily equal with linguistic phrases. For example, theItalian word sequence Io sono (pronoun as a subject + sentence predicate) can be aphrase which is translated as one translation unit in English phrase I am. To carry outthis type of translation, we need translation probabilities for phrase pairs as shown intable 5.

Translation Probability p(e|f)i am 0.80i was 0.10i have been 0.05myself am also 0.03we are 0.02

Table 5: Example phrase translation probabilities for io sono

When TL phrases are generated, they have to be reordered in order to appear in thecorrect phrase order in the generated sentence. This is modelled by a reordering model.Instead of learning reordering probabilities from the data, a cost function is applied.The cost function express how expensive the movement of some phrase is.In the following, the details on phrase-based SMT are described with respect to the im-

plementation of phrase-based SMT in an open source SMT systemMoses [Koehn et al., 07].

Phrase translation tableThe �rst step in obtaining translation phrases is word alignment of parallel sentences.In Moses, a word alignment tool GIZA++ (cf. chapter 4.1) is used. GIZA++ allowsone-to-many word alignment, where at most one TL word can be aligned with eachSL word. To account for this �aw, Moses expands the word alignment by aligning the

35

words in both directions. The result of bidirectional alignment is a man-to-many wordalignment of the sentence pair. The two alignments can be combined in several ways:They can be intersected or the union can be build. In Moses, these two methods arecombined. Firstly, the intersection of the bidirectional alignments is computed. In thenext step, the additional alignment points are heuristically chosen from the alignmentunion.When word alignment is given, translation phrases can be derived. The phrases must

be consistent with the word alignment which means that the words of a phrase pair areonly aligned with the words within these phrases and not to the words outside.After the phrase pairs are collected, their translation probability is estimated by rel-

ative frequency as shown in (8).

φ(f |e) = count(f , e)1∑

ficount(fi, e)

(8)

The probability of a phrase f given a phrase e is a product of the count of how oftenthe phrases occur together and the total number of occurrences of the phrase e.

Reordering modelsThe reordering model in Moses is based on the phrase reordering relative to the previousphrase. We de�ne starti as a start position of the preceding phrase i, and endi as thelast word of that phrase. The reordering distance is computed as shown in (9).

x = starti − endi−1 − 1 (9)

On the computed reordering distance, the cost function in (10) is applied, where α ∈[0; 1].

d(x) = α|x| (10)

Generally, this reordering model punishes any movement. This works �ne for the lan-guage pairs with similar syntax, but it leads to bad translation for the language pairswhich di�er signi�cantly with respect to the word order. Although the language modelsshould account for the di�erent word order in SL and TL sentences, they are limited asthey consider only small word sequences. For this reason, phrase-based SMT uses anadditional reordering model: lexicalized reordering model. It models the orientation ofan extracted phrase pair. The orientation speci�es the position of the TL phrase. It canbe monotone which means that its position is equal with the position of the SL phrase.Furthermore, it can be swap indicating that the SL and the TL phrases are swapped.Finally, the phrases can be discontinuous, thus interrupted by other phrases.

Language modelDi�erent word order in di�erent languages poses a problem for the statistical machinetranslation translation. A language model which is build out of the large target languagetext should account for this. It consists of automatically computed n-grams which

36

express the probability of a target word ej if it is preceded by n already generated targetwords. The computation of the probability of a target sentence e = e1, ..., el given atrigram language model is shown in (11).

p(e) = p(e1, e2, ..., el)

= p(e1)p(e2|e1) ... p(el|e1, e2, ... , el−1) (11)

' p(e1) p(e2|e1) ... p(el|en−1, en−2)

The model parameters are computed using the counts of the word sequences as shownin (12).

p(w3|w1, w2) =count(w1, w2, w3)∑w count(w1, w2, w)

(12)

Log-Linear ModelThe translation model in phrase-based SMT uses the lexical translation table φ(f |e),the reordering model d and the language model pLM(e). The models are combined in alog-linear model shown in (13).

ebest = argmaxe

I∏i=1

φ(fi|ei)λφ d(starti − endi−1 − 1)λd|e|∏i=1

pLM(ei|e1...ei−1)λLM (13)

Di�erent models used in the phrase-based translation are weighted by λ. The weightsfor the translation model λφ, the reordering model λd and the language model λLM arelearned from the bilingual data in order to maximize the likelihood of the training data.

37

5 Word alignment of English and Italian verb phrases

This chapter describes a method for improving the base word alignment with respectto the problem of null subject pronouns in Italian and obligatory subject pronouns inEnglish. Since the English subject pronoun does not necessarily have a counterpart inItalian, it is often aligned with incorrect words in a given parallel Italian sentence.I present a rule-based method for the correction of the base alignment of English

subject pronouns. Since the alignment of the subject pronouns depends on the alignmentof the sentence predicate, rules have been developed which de�ne the alignment not onlyof English subject pronouns, but also of entire English verb phrases (VP) which belongto a subject pronoun. In the following, the term verb phrase is used for the combinationof the (null) subject pronoun, negation and the verbal elements of the VP.After a short motivation for the base alignment improvement in the following section,

I describe the data that worked with (cf. section 5.2). In section 5.3, the algorithm usedto compute the VP alignment is presented. The rules based on part of speech tags whichhave been developed and applied on the base word alignment, English parses and Italiantagged sentences, are described in chapter 5.4. The evaluation results of the improvedalignment are discussed in section 5.5. Some extensions of the proposed method areshown in section 5.6.

5.1 Motivation

Since the pronominal subject in Italian can be omitted, the English subject pronounis often aligned with arbitrary Italian words. These include for example conjunctions,punctuation, etc. Knowing about the word category of these Italian words, rules can beapplied which prohibit the alignment of the English subject pronoun with these words.The rules are based on the PoS of the words whose alignment should be computed.If an Italian subject pronoun is omitted, the information about the subject is provided

by a �nite verb which is aligned with the English �nite verb (cf. section 2.2.2). WhatI would like to achieve is the alignment of the English subject pronoun with the Italianverb that corresponds to the English �nite verb. This is the reason why not only theEnglish subjects are examined, but also all verbal elements of VPs. In the following, theterm VP denotes a part of a VP which contains only verbal elements and a negation.Since the sequence of the verbs in English VPs can be interrupted by adverbs or

embedded clauses, parse trees of the English input are used to identify English VPs(verbal elements and negation) correctly. The Italian input has been PoS tagged toprovide information about word categories. Since the Italian parser was not available,the Italian VPs are de�ned as PoS sequences. For each sentence pair, the English parsetree is searched to �nd clauses with a pronominal subject. The tagged Italian sentence issearched in order to �nd all Italian VPs. Using the base alignment of the elements of anEnglish phrase which has a pronominal subject, the parallel Italian VP is identi�ed. Thealignment rules compute alignment for PoS sequences of the parallel phrases, wherebyonly the PoS which mark verbs, negation and personal pronouns are taken into account.The rule-based VP alignment is integrated in the base alignment of the sentence so that

38

the base alignments of the phrase elements are removed.I assume that every English VP with a pronominal subject has a parallel Italian VP

(cf. (A1) in section 1.2). This assumption is made to limit the number of PoS sequencesfor which the alignment rules are de�ned (cf. (A3) in section 1.2). Furthermore, I assumethat the base alignment is correct enough to allow for identi�cation of parallel Englishand Italian VPs (cf. (A2) in section 1.2). The assumptions hold in many cases but theyalso leads to problems which will be shown in section 5.5.2.In the following, the algorithm for the application of alignment rules is presented (cf.

section 5.3), as well as the rules based on word categories (expressed by PoS tags) ofEnglish and Italian words (cf. section 5.4).

5.2 Data preparation

The alignment rules have been developed and applied on a reduced version of the Eu-roparl corpus [Koehn, 05] consisting of 749,646 parallel sentences.Since the alignment rules do not operate on word level but on PoS level, it was

necessary to preprocess the parallel corpus. English sentences have been parsed in orderto simplify the search for pronominal subjects and VPs. Since English parse tree nodesare underspeci�ed with respect to the grammatical function of phrases, I wrote a programwhich enriches relevant nodes with their grammatical function. Since a parser for Italianwas not available, the Italian sentences have been PoS tagged in order to get informationabout the word categories.In the following, I describe the steps in the data preparation process.

5.2.1 English

English sentences have been parsed with the generative parser [Charniak, 00]. The parsetrees allow to identify the subclauses of the input sentence, the subjects and the VPs.The parser also assigns to each word its part of speech tag12 which is needed to matchconditions in the alignment rules. An example parse tree is shown in (52).

(52) �I would like your advice about Rule 143 concerning inadmissibility.�

12English PoS are listed in appendix B

39

(TOP

(S (NP (PRP I))

(VP (MD would)

(VP (VB like)

(NP (NP (PRP your)

(NN advice ))

(PP (IN about)

(NP (NNP Rule)

(CD 143)))

(VP (VBG concerning)

(NP (NN inadmissibility))))))

(. .)))

NP nodes in the parse tree in (52) are not speci�ed with respect to their grammaticalfunctions. To determine if some NP is a subject or an object, the context has to be takeninto account. The assumption that the �rst NP under S (representing topic position)is a subject does not always hold (cf. parse tree in (53)) which makes the search for a(pronominal) subject more complicated.

(53) �This makes it necessary to also take account of the ways in which materials andpackaging are a�ected by cold of this kind.�

(TOP

(S

(NP1 (DT This))

(VP1 (VBZ makes)

(S

(NP2 (PRP it))

(ADJP (JJ necessary)

(S

(VP2 (TO to)

(ADVP (RB also))

(VP (VB take)

(NP ...

(. .)))

NP2 in (53) is actually an object of the verb in the preceding subclause (with sentencepredicate makes) and a subject of VP2. Thus, the underspeci�cation of NP nodesrequires a context check (father and sister nodes) in order to identify the subject of aVP.Not only the underspeci�cation of NP poses a problem for a correct identi�cation of a

subject and its VPs. There are verbs which subcategorize an in�nitival verb phrase (to-in�nitive), for example I would like [to say]XCOMP (cf. examples (9b) - (9d) in section2.2.1). The extracted VP which belongs to a pronominal subject should also contain a

40

subcategorized in�nitive. To-in�nitives can be embedded in various nodes, for examplein VP or ADJP (as in parse tree in (53)). Since I wanted to handle only to-in�nitiveswhich are subcategorized by a verb in a preceding clause (and not for example by anadjective as in �gure 53), it was also necessary to examine the context of to-in�nitives(VP nodes) in order to make a decision if a to-in�nitive should be treated as a part ofa �nite VP whose alignment should be computed.There are two ways to solve the problem of identifying subjects and VPs. One way

is a runtime examination of the context of corresponding nodes, and the other way isto enrich the parse trees with function tags as a part of data preprocessing. I chose thesecond approach which resulted in a program which enriches English parse trees13, anda relative simple method for subject and VP extraction from a modi�ed parse tree.The tool which transforms original Charniak parse trees examines only NP and S nodes

enriching them with a tag expressing their grammatical function. The transformationrules examine the context of the relevant nodes. If conditions for a speci�c function tagare complied, the original node is enriched by the corresponding function tag.Transformation rules for NP are given in (54). NP nodes are marked as subject NPs

only if the father node is S or SBAR, they are not preceded by a VP (for example[LetV B]V P [mePRP ]NP [sayV B]V P ) and a sister VP is not a to-in�nitive (for example[[usPRP ]NP [toTO sayV B]V P ]S). In an interrogative sentence, the �nite verb in front ofa subject is not embedded in a VP, for example [[CanMD] [youPRP ]NP [sayV B]V P ... ]S,so that, in this case, the NP would be identi�ed as a subject NP. The conditions forsubject NPs are summarized in the rule (54a). If these conditions are not ful�lled, theNP is an object (NP rule (54b)). Furthermore, the NP node is identi�ed as an objectwhen the father node is a VP (NP rule (54c)).

(54) a. NP → NP-SUBJif the father node is S or SBAR, and there is no preceding VP under thefather node, and if there is a sister VP node, it is not a to-in�nitive

b. NP → NP-OBJif the father node is S or SBAR, and there is a preceding VP or sister VPnode which is a to-in�nitive

c. NP → NP-OBJif the father node is a VP

It was also necessary to examine S nodes to determine if they consist of a to-in�nitive.If this is the case, and the in�nitive is subcategorized by a verb in the preceding clause,the S node should be annotated by a function tag that re�ects these features, namelyS-XCOMP. For example, the examination of the phrase I would like to say, in which theto-in�nitive to say is embedded in the category S, should identify the to-in�nitive as anin�nitive which belongs to the preceding �nite verb. If the example phrase is modi�ed

13[Blaheta, 2004] developed a function tagger which provides parse trees with function tags annotatedto the phrases and words. I was not able to run this tagger for which reason I implemented myown tool for this task. It is important to note that my program enriches only the nodes which arerelevant for the presented work.

41

resulting in a phrase I would like you to say, the to-in�nitive should not be determinedas a part of a VP [would like]V P , since its subject is not I. The parses for such sentencesare shown in (55) and (56).

(55) �I would like once again to wish you ...�

(S1

(NP (PRP I))

(VP (MD would)

(VP (VB like)

(S2 (ADVP (RB once)

(RB again))

(VP (TO to)

(VP (VB wish)

(NP (PRP you))

...

(. .)))

(56) �I would therefore once more ask you to ensure ...�

(TOP

(S3

(NP (PRP I))

(VP (MD would)

(ADVP (RB therefore))

(VP (ADVP (RB once)

(JJR more))

(VB ask)

(NP (PRP you))

(S4

(VP (TO to)

(VP (VB ensure)

...

(. .)))

The transformation rules for S nodes are given in (57). The condition for applying thesetransformation rules is that a S or SBAR node is embedded in a VP. If the nodes havea preceding sister node NP (as S4 in (56)), they should be marked as S-OBJXCOMPexpressing that the to-in�nitive does not have the same subject as the VP in a super-ordinate clause (cf. S rule (57b)). If this is not the case, the node should be marked asS-XCOMP expressing that the to-in�nitive belongs to the superordinate VP (cf. S2 in(55)). This is reached by the S rule (57a).

42

(57) a. S, SBAR → S-XCOMP, SBAR-XCOMPif the father node is VP and it is not preceded by a sister node NP

b. S, SBAR → S-OBJXCOMP, SBAR-OBJXCOMPif the father node is VP and it is preceded by a sister node NP

The appliance of the transformation rules in (54) and (57) on the parse trees in (55) and(56) results in the modi�ed parse trees shown in (58) and (59).

(58) �I would like once again to wish you ...�

(S1

(NP-SUBJ (PRP I))

(VP (MD would)

(VP (VB like)

(S2-XCOMP (ADVP (RB once)

(RB again))

(VP (TO to)

(VP (VB wish)

(NP-OBJ (PRP you))

...

(. .)))

(59) �I would therefore once more ask you to ensure ...�

(TOP

(S3

(NP-SUBJ (PRP I))

(VP (MD would)

(ADVP (RB therefore))

(VP (ADVP (RB once)

(JJR more))

(VB ask)

(NP-OBJ (PRP you))

(S4-OBJXCOMP

(VP (TO to)

(VP (VB ensure)

...

(. .)))

Modi�ed parse trees such as (58) and (59) simplify the search for English subjects andcorresponding VPs. Having enriched NP and VP nodes, we can search directly for nodesthat correspond to the phrases we are interested in.

43

5.2.2 Italian

Italian sentences have been tagged with TreeTagger [Schmid, 95] creating an input textconsisting of the words with their PoS14.The PoS tagged Italian sentence in (60b) corresponds to the sentence in (60a). The

words are enriched with their PoS. �#� is a delimiter between a word and its PoS.

(60) a. Nonnot

credobelieve

peròbut

chethat

lathe

relazionereport

arrivicomes

tardi.late.

'But I do not believe that the report comes too late.'

b. Non#NEG credo#VER:fin però#ADV che#CHE la#ART

relazione#NOUN arrivi#VER:fin tardi#ADV .#SENT

On the basis of the PoS, we can identify the verbs (VER:�n, VER:in�, VER:ppast,VER2:�n, VER2:geru, etc.), negation (NEG), and subject pronouns (PRO:pers, PRO:demo)in tagged Italian input. For example, the PoS VER:�n bears information that credo isa �nite verb.

5.2.3 Data preprocessing errors

The tagger for Italian which was used in this work was trained using the Italian mor-phological lexicon MorphIt [Zanchetta & Baroni, 05] and a set of about 100,000 man-ually taged words from the newspaper corpus Repubblica [Baroni et al., 04]. The ac-curacy of the tagger can be compared with the accuracy of the Italian TreeTagger[Schmid, Baroni et al., 2007] reported on The Part of Speech Tagging Task EVALITA2007. The TreeTagger reaches accuracy of 97%.15

However, the examination of the Italian tagged input revealed that some words areoften tagged falsely. Example (61a) shows the Italian counterpart of the English sentencein (52).16

(61) a. �Gradirei avere il suo parere riguardo all' articolo 143 sull' inammissibilità.�

b. Gradirei#VER:fin avere#VER:infi il#ART suo#DET:poss parere#NOUN

riguardo#VER:fin all#NOUN '#PUN articolo#NOUN 143#NUM

sull#NOUN '#PUN inammissibilità#ADJ .#SENT

The example sentence in (61b) contains relatively lot of false tagged words. For example,the ambiguous word form riguardo which can be a noun (= consideration) or a verb (= I concern) is treated in (61b) as a noun which is in this context not correct. One ofthe common tagging errors is that of prepositions merged with de�nite article. Whenthese word forms appear in front of a word which begins with a vocal, they end withan apostrophe: sull', all'. The tagger does not recognize these word forms as mergedforms of an article and a preposition, which would become a tag ARTPRE, but as words

14Italian PoS are listed in appendix A15The comparison of the evaluation results was suggested by M. Baroni (pers. comm.).16These examples are taken from the parallel corpus Europarl.

44

of an arbitrary category (for example, as a noun (NOUN ) or �nite verb (VER:�n)). Icorrected article tags errors manually to reduce the number of the false verb tags, sincethey lead to erroneous identi�cation of the Italian VPs.

5.3 Applying alignment rules

The program for base alignment improvement expects a set of parallel sentences of Italian(with PoS) and English (as a parse tree) as input. Details about the corpus preparationhave already been described in section 5.2. The parallel sentences are automaticallyword aligned with GIZA++ [Och & Ney, 03]. This base word alignment is the basis forthe rule-based VP alignment.

1: function correct_align(en_parse, it_tag, base_align)2: new_align . New alignment3: for e in en_parse do . English sentence4: e_subj_verb← search_subj_verb(e)5: phrase_pair ← search_it_vp(e_subj_verb, it_tagged, base_align)6: new_align← align(phrase_pair, base_align)7: pos_pattern← derive_pos_pattern(new_align)8: end for9: return new_align10: end function

Figure 1: Main program: correct_align

The main program is shown in �gure 1. Several steps are done for each sentence pair,beginning with a check whether the English sentence e contains a pronominal subject.After identifying the English pronominal subject and its verbs (line 4 in �gure 1), itlooks for the Italian VP which the English words are aligned with (line 5 in �gure 1).The procedure which ful�ls this task is described in section 5.3.2. The output of thisprocedure is a phrase pair containing words enriched with information about the wordcategory (PoS) and position of the word in the sentence.Having a phrase pair whose alignment should be computed, we can now call the func-

tion which computes alignment of the given phrase pair applying PoS based alignmentrules (line 6 in �gure 1). The program also derives PoS and PoS patterns of the parallelphrases (line 7 in �gure 1).17 The main program returns the computed VP alignmentwhich is then integrated in the word alignment of the sentence pair.

17The counts on the PoS occurrences could be used to compute the probability of translating an EnglishPoS epi into an Italian PoS ipj . The derived PoS patterns could be used to check the correctnessof the found PoS patterns. Furthermore, the PoS translation pairs could be used to examine whichtenses are mostly used in the given language pairs. They would also allow the examination of thetense similarity: How often the same tense is used in the parallel sentence pair, or how often thetense and voice diverge.

45

A graphical illustration of the complete system is shown in �gure 2. Each box repre-sents one processing step. The processing steps are explained in detail in the followingsections.

IT – ENParallel corpus

Charniakparser TreeTagger

Enrichingparse trees

Seaching sentences with

pronominal subject

Word alignmentwith GIZA++

Seaching IT - VPs

Identification of VP pairs

AligningVP elements

Merging base alignmentwith

VP - alignment

EN IT

Alignment improvementsystem

Preprocessing

Figure 2: System components

After the VP alignments are produced, they are merged with the base word alignment.In the resulting alignment, the pronominal subjects and VPs in both languages haveonly the alignments computed by the program for the VP alignment. The baselinealignments for these words are deleted.The function align(phrase_pair, base_align)(line 6 in �gure 1) which computes the

alignment of the VP pairs, is shown in �gure 3. The functions for alignment of dif-ferent word classes of English (align_subj(e, it), align_v�n(e, it), etc.) implement thealignment rules described in section 5.4. For the given English word, compatible Italian

46

words are identi�ed. The examination of the alignment takes only PoS into account.If there is no appropriate Italian word (with appropriate PoS), the given English wordstays unaligned.

1: function align(phrase_pair, base_align)2: new_align . Computed alignment3: en← english_words(phrase_pair)4: it← italian_words(phrase_pair)5: for e in en do6: if subject(e) = True then7: new_align.append(align_subj(e, it))8: else if finverb(e) = True then9: new_align.append(align_vfin(e, it))10: else if infpartger(e) = True then11: new_align.append(align_infpartger(e, it))12: else if negation(e) = True then13: new_align.append(align_neg(e, it))14: else if infparticle(e) = True then15: new_align.append(align_infpart(e, it))16: end if17: end for18: return new_align19: end function

Figure 3: Alignment check and improvement

5.3.1 Identi�cation of Italian VPs

Due to the lack of availability of an Italian parser, the extraction of correct Italian verbphrases using the base word alignment posed a great problem. In this work, I madethe assumption that the base alignment is su�ciently correct to make it possible to�nd the Italian phrase corresponding to the given English phrase. Unfortunately, theItalian phrases were often incomplete. This means that some VP elements were missing.Therefore, I identify all Italian VPs before the search for a matching Italian VP is carriedout which is described in the following section.The identi�cation of Italian VPs is based on PoS. I de�ned PoS which mark the start

of a VP, and PoS of other verbal elements, which can be a part of a VP. An Italiansentence is searched through until a PoS is found that can be a start of a VP. Fromthis sentence position, the search for other elements goes so long until the sentence endor another VP starting PoS is reached. The search function returns the Italian wordsequences that contain a personal pronoun, negation and verbal elements of a VP. OtherVP elements are ignored.This method �nds not only �nite VPs starting with a pronoun, �nite verb or negation,

but also in�nitival VPs, and gerundive phrases which often consist of only one gerundive.

47

(62) a. Perchéwhy

nonnot

esistonoexist

istruzioniinstructions

dato

seguirecontinue

inin

casocase

diof

incendio?�re?

'Why there are no instructions in case of �re?'

b. Perché#WH non#NEG esistono#VER:fin istruzioni#NOUN da#PRE

seguire#VER:infi in#PRE caso#NOUN di#PRE incendio#NOUN ?#SENT

The sentence in (62b) consists of two VPs. The implemented method for identi�cationof Italian VPs �nds following verb phrases:

1. non#NEG esistono#VER:�n

2. da#PRE seguire#VER:in�

Inde�nite phrases like [da seguire]XCOMP are extracted as independent phrases since theyoften correspond to complete English clauses. An example for such case in shown in �g-ure 4. The English VP [would ask ]V P does not include the to-in�nitive [to request ]XCOMP

since the to-in�nitive and the �nite verb phrase do not have the same subject. For thisreason, it would be wrong to handle the Italian in�nitive [die chiedere]V P as a part ofthe Italian VP [prego]V P which corresponds to the English VP [would ask ]V P .It is also possible to translate a �nite English sentence as an Italian in�nite clause.

This is an additional reason, why I handle Italian in�nitives as separate VPs.

I5/PRP ii

))TTTTTTTTTTTTTTTTTTTTTT la5/CLI

would6/MD oo // prego6/V ER : fin

ask6/V Buu

55jjjjjjjjjjjjjjjjjjjjjjdi7/PRE

you7/PRP��

@@�������������������������������chiedere6/V ER : infi

to8/TOzz

::uuuuuuuuuuuuuuuuuuuuuuuuuu

request9/V Bzz

::uuuuuuuuuuuuuuuuuuuuuuuuuu

Figure 4: Alignment of I would ask you to request and la prego di chiedere

These rather simple rules for detection of complete Italian VPs do not always providecorrect verb phrases. Mistakes are made if the word order is changed, or if a sequenceof VP start elements occurs. Furthermore, false tagging leads also to the identi�cationof false Italian VPs as shown in (63b) (cf. section 5.2.3).

48

(63) a. �Comeas

avretehave

avutohad

modoway

dito

constatareobserve

ilthe

grandebig

�baco�bug

delof the

millennio�millennium�

nonnot

si èwas

realizzato.�realized.

'As you could have seen, the �millennium bug� did not materialize.'

b. come#WH avrete#AUX:fin avuto#VER:ppast modo#NOUN di#PRE

constatare#VER:infi il#ART grande#ADJ "#PUN baco#VER:fin del#ARTPRE

millennio#NOUN "#PUN non#NEG si#CLI è#AUX:fin

materializzato#VER:ppast.#SENT

The method for the identi�cation of Italian VPs �nds the following verb phrases for thesentence (63b):

1. avrete#AUX:�n avuto#VER:ppast

2. di#PRE constatare#VER:inf

3. baco#VER:�n

4. non#NEG è#AUX:�n materializzato#VER:ppast

The VP in 3 (baco#VER:�n) is not correct. Baco (= the bug (noun)) has been assignedthe wrong PoS resulting in extraction of a false VP. Although the rules for the identi�-cation of the Italian VP can lead to false VPs, they provide a relatively good basis forthe process of searching for an Italian VP that corresponds to a given English phrasewhich is described in the following section.

5.3.2 Identi�cation of the most probable Italian VP

Good VP alignment results can be achieved by applying alignment rules only if therules are applied on parallel English and Italian VPs. The procedure for searchingfor the matching Italian VP given an English VP is given in �gure 5. The methodfor determination of the best Italian VP given an English VP is based on a count ofalignments between English and Italian words in these phrases. So, I assume that thebase alignment is correct on the level of phrase alignment. This means, that the bestItalian phrase has the most base alignment links for the given English word sequence.The search function in �gure 5 receives as input English subject and its verbs, a list of

Italian VPs extracted from the parallel sentence (as described in the previous section),and the base alignment. For each Italian VP, the number of alignment links betweenits elements and English input is computed. The VP with the most alignment linksrepresents the best Italian VP for the English input.

49

1: function search_it_vp(en_subj_vp, it_vps, base_align)2: word_pairs← [] . EN and IT words which belong to parallel phrases3: en_al← Alignments of EN words4: vp_links← [] . Pairs: (IT VP, # links to its elements)5: for it_vp in it_vps do . Loop over all Italian VPs6: links← 0 . # alignments for EN phrase and IT candidate VP7: for (en, it) in en_al do . Alignment pairs8: if it ∈ it_vp then . Italian word it is a part of Italian VP it_vp9: links+ = 110: end if11: end for12: vp_links.append(it_vp, links)13: end for14: best_vp← max(vp_links) . Italian VP with most alignments15: word_pairs.append(elements_of(best_vp))16: return word_pairs17: end function

Figure 5: Search for the best Italian VP

5.4 Alignment rules

The alignment rules de�ne the alignment of the relevant sentence parts in an Englishand Italian parallel corpus. They are based on an already created alignment (basealignment) and the PoS of the words that are observed. In previous sections, it wasalready mentioned that these sentence parts are subject pronouns, negation and verbalsentence predicates. In �gure 3 in section 5.3, I showed the program for alignmentimprovement. It consists of a loop over the words of the English input phrase. Foreach word, its word category is derived (on the basis of PoS), and the function is calledwhich computes alignment for the found word category. There are �ve functions whichcompute alignment for �ve word category groups. If the input word e is:

1. subject, i.e. its PoS is PRP,the function align_subj(e, it) is called.

2. �nite verb, i.e. its PoS is:

• AUX : auxiliary

• MOD : modal verb

• VBZ : 3rd person singular present

• VBP : non-3rd person singular present

• VBD : past tense

the function align_v�n(e, it) is called

3. in�nitive, participle or gerundive, i.e. its PoS is:

50

• VB : in�nitive• VBN : past participle

• VBG : gerundive

the function align_infpart(e, it) is called

4. negation, i.e. its PoS is RB18,the function align_neg(e, it) is called

5. in�nitive particle to, i.e. its PoS is TO,the function align_infpart(e, it) is called.

In the following sections, the alignment rules for each word category are presented (cf.chapters 5.4.2 - 5.4.6). But before the rules are described, we should examine the syntaxof English and Italian VPs. The PoS sequence which occurs in a speci�c tense is crucialfor de�ning alignment rules. Each rule is applicable only if the context constraints areful�lled. The context is de�ned by PoS of the words of the given phrases.

5.4.1 Syntax of the English and Italian VPs

The VP alignment rules use word categories expressed by PoS to compute the wordalignment of parallel VPs. Since the alignment of a speci�c PoS is not always thesame but context-dependent, it is necessary to examine which contexts (PoS sequences)are possible. Having this information, constraints can be de�ned which limit the wordalignment of PoS in a speci�c PoS context.In the following, we take a closer look to the composition of English and Italian VPs

(PoS sequences).

EnglishEnglish tenses can be realized by one verb only or by a sequence of verbal elements.Since the alignment rules are based on PoS, we have to know which PoS sequences inEnglish VPs are possible. We start with examples of tenses which have only one verb.In the following examples, only relevant tokens are marked with their PoS.

(64) a. He/PRP sleeps/VBZ.

b. It/PRP is/AUX nice.

c. I/PRP sleep/VBP.

d. He/PRP went/VBD home.

If we would like to negate the sentences in (64), we would get composed VPs containingan auxiliary, a negation and an in�nitive.

(65) a. I/He/PRP do/does/did/AUX not/RB sleep/VB.

18Since there is no di�erence between tags for negation and other adverbs, the word forms tagged withRB had to be examined to identify the negation.

51

b. I/He/PRP do/does/did/AUX not/RB have/do/AUX it.

Constructions with modal verbs are shown in (66).

(66) a. He/PRP will/would/MD (not/RB) sleep/VB.

b. He/PRP will/would/MD (not/RB) have/do/AUX it.

c. He/PRP will/would/MD (not/RB) be/AUX sleeping/VBG.

d. He/PRP will/would/MD (not/RB) be/AUX having/doing/AUXG.

e. He/PRP will/would/MD (not/RB) have/AUX slept/VBN.

f. He/PRP will/would/MD (not/RB) have/AUX had/done/AUX it.

g. He/PRP would/MD (not/RB) have/AUX been/AUX sleeping/VBG.

h. He/PRP would/MD (not/RB) have/AUX been/AUX having/doing/AUXGthis.

The following example sentences show the tenses which contain an auxiliary.

(67) a. He/PRP is/was/AUX (not/RB) sleeping/VBG.

b. He/PRP is/was/AUX (not/RB) having/doing/AUXG this.

c. He/PRP has/had/AUX (not/RB) slept/VBN.

d. He/PRP has/had/AUX (not/RB) been/AUX sleeping/VBG.

e. He/PRP has/had/AUX (not/RB) been/AUX having/doing/AUXG.

f. I/PRP am/AUX going/VBG to/TO sleep/VB.

g. I/PRP am/AUX going/VBG to/TO have/do/AUX this.

If, for example, English auxiliaries should be aligned di�erently depending on the VPthat they belong to, the composition of the English VP has to be determined by exam-ining its PoS sequence. If, for example, has/AUX (cf. example (67c)) should be alignedwith the corresponding Italian auxiliary only if has/AUX is a part of the composed VP,we would require the English VP to consist of a participle. Thus, in addition to AUX,the PoS sequence of the English VP should also contain the PoS VBN.A closer observation of the examples in (66) and (67) reveals that the PoS AUX is

used not only for the auxiliaries. The verbs am and do in (67g) have the same PoSalthough do should be considered here as a main verb.19 This causes a problem becausedi�erent word categories are handled with di�erent sets of the alignment rules. If theword category is erroneous, false alignment rules can be applied. We will come back tothis problem in section 5.4.4.

ItalianAgain, we start with tenses which have only one verb. The optional subject pronounand the negation are put in brackets.

19[Charniak, 00] expands the Penn Treebank Tagset (listed in appendix B) with the tags AUX andAUXG which are assigned to the auxiliaries.

52

(68) (Egli/PRO:pers)He

(non/NEG)not

dorme/dormivo/dormii/dormirò/VER:�n.sleeps/has slept/had slept/will sleep

'He sleeps/has slept/had slept/will (not) sleep.'

(69) (Io/PRO:pers)I

(non/NEG)not

dormo/dormissi/dormirei/VER:�n.would sleep/would have slept/would sleep

'I would sleep/would have slept/would (not) sleep.'

Italian composed tenses require an auxiliary or a modal verb, and a participle, an in-�nitive or a gerundive. Examples in (70) and (71) show simple sentences with the verbdormire (= to sleep) which have an auxiliary avere (= to have). In Italian, there are alsoverbs like andare (= to go), venire (= to come) which occur with the auxiliary essere (=to be). Since the PoS sequence is the same for tenses with both auxiliaries, I do not listexample sentences with these verbs. Examples in (70) - (74) show all composed Italiantenses with all possible PoS sequences.

(70) (Egli/PRO:pers)He

(non/NEG)(not)

ha/avrà/AUX:�nhas/will have

dormito/avuto/VER:ppast.slept/had

'He has/will have (not) slept.'

(71) (Io/PRO:pers)I

(non/NEG)(not)

abbia/avrei/AUX:�nwould have/will have

dormito/VER:ppast.slept

'I would (not) have/would (not) have/will (not) have slept.'

(72) (Io/PRO:pers)I

(non/NEG)am

sto/AUX:�n(not)

dormendo/VER:geru.sleeping

'I am not sleeping.'

(73) (Io/PRO:pers)I

(non/NEG)(not)

posso/potrò/VER2:�ncan

dormire/VER:in�.sleep

'I can not sleep.'

(74) (Io/PRO:pers)I

(not/NEG)not

sto/stavo/VER:�nam/was

facendo/VER:gerudoing

...

'I am/was (not) doing ...'

Modal verbs subcategorize an in�nitive as shown in (73). When these verbs are used insome of the tenses which are composed of an auxiliary and a participle, a di�erent PoSsequence is generated (cf. example (75)).

(75) (Io/PRO:pers)I

(non/NEG)(not)

ho/avrei/AUX:�nhave

potuto/VER:ppastcan

constatare/VER:in�observe ...

'I would (not) have/would (not) have/will (not) have observe ...'

Whereas some tenses in passive voice (cf. example in (76)) do not di�er from composedtenses shown in (74) regarding PoS sequence, some past tenses in passive voice requiretwo forms of an auxiliary as showed in (77).

53

(76) (Egli/PRO:pers)He

(non/NEG)(not)

è/saràAUX:�nis/was/will be

amato/VER:ppast.loved

'He is (not)/was (not)/will (not) be loved.'

(77) (Egli/PRO:pers)He

(non/NEG)(not)

è/era/AUX:�nis/will be

stato/AUX:ppastbeen

amato/VER:ppast.loved

'He has (not) been/will (not) have been loved.'

There is one construction in Italian which is often used to abbreviate a �nite sentence. Itconsists of a gerundive and, if the verb is modal, of an in�nitive. Since these constructionscan be used as translations of English �nite clauses, they should also be taken intoaccount.

(78) (Non/NEG)(not)

ribadendo/VER:gerustressing

...

...

'(Not) stressing ... / I (do not) stress ...'

(79) (Non/NEG)not

volendo/VER:geruwishing

a�rontare/VER:in�confront

...

...

'(Not wishing to confront ... / I (do not) wish to confront ...'

Let us now consider parallel sentences (67c) and (70). We have already de�ned theEnglish context constraints that have to be ful�lled if he/AUX should be aligned withthe Italian auxiliary, in our example with ha/AUX:�n. If we would like to allow thealignment of the English �nite auxiliary with any Italian �nite verb form, the conditionon Italian is that the Italian verb (here, an auxiliary ha/AUX:�n) is �nite, i.e. its PoSmust contain ��n� (e.g. AUX:�n, VER:�n, VER2:�n).In the following sections, the alignment rules for di�erent word categories are pre-

sented.

5.4.2 Subject pronouns

Since the pronominal subject in English does not have to have its pronominal counterpartin the Italian parallel sentence, the alignment of the subject pronoun is often not correct.Figure 6 shows an example of an incorrect base word alignment.20

if9/INOO

��

you10/PRPOO

��

wish11/V BPOO

��se8/CON lo9/CLI desidera10/V ER : fin

Figure 6: Incorrect base alignment of if you wish and se lo desidera

20The subscripts in the alignment �gures mark the word position in a sentence.

54

The phrases in �gure 6 are taken form the sentences which are shown in (80) and (81).

(80) That is precisely the time when you may, if you wish, raise this question, ...

(81) Èisappuntoexactly

inin

quell'this

occasioneoccasion

che,that,

seif

loyou

desidera,which,

avràwill have

modochance

dito

sollevarerase

lathe

suayour

questionequestion

...

...

Figure 6 shows an alignment of embedded clauses if you wish and se lo desidera. TheEnglish subject pronoun is aligned with the Italian object clitic lo whereas the predi-cate wish is correctly aligned with Italian predicate desidera. In this case, it would becorrect to align the English pronoun with the Italian �nite verb since you wish shouldbe translated as desidera21. The correct alignment is shown in �gure 7.

if9/INOO

��

you10/PRPii

))TTTTTTTTTTTTTTTwish11/V BPOO

��se8/CON lo9/CLI desidera10/V ER : fin

Figure 7: Correct alignment of if you wish and se lo desidera

The correctness of the alignment in �gure 7 is linguistically motivated. The informationprovided by the English subject pronoun you and the �nite verb wish is the same as theinformation provided by the Italian �nite verb desidera regarding person and numberof the subject. This is why both English words should be aligned with the Italianverb. In general, the English pronoun should be aligned with the Italian �nite verbthat corresponds to the English �nite verb. The subject alignment rule lead to the linkbetween the English word with PoS PRP (English subject pronoun) and the Italianword with PoS VER:�n (�nite verb), VER2:�n (�nite modal verb), or AUX:�n (�niteform of an auxiliary). This rule is summarized in (82a).In �gure 7, the alignment rule (82a) leads to a deletion of the base alignment link

between the English pronoun you and the Italian object clitic lo. Since lo does not havefurther base alignment links, it remains unaligned in the given sentence pair.Yet, the Italian subject pronoun is not always omitted. If it is expressed overtly, the

English subject pronoun should be aligned only with it. The pronouns bear the sameinformation about number, person and gender of the subject. In such a context, theItalian verb is not needed to derive these characteristics of the English subject pronoun,so I do not align it with the Italian �nite verb. This rule is presented in (82b). The ruleassociates the English pronoun (PRP) with the Italian pronouns (PRO:pers - personalpronoun, PRO:demo - demonstrative pronoun).The English �nite clause can also be translated as a gerundive construction in Italian.

The gerundive bears the semantics that corresponds to the semantics of the English

21The Italian �nite verb is 3rd person singular, so this should be understood as a polite form of addresswhere the addressee is one person.

55

predicate (for example, IPRP thinkV BP ↔ pensandoV ER:geru). In such constructions, theaim is to align the English pronoun and �nite verb with the same Italian verb (here,gerundive). This rule is expressed in (82c). When the Italian VP is an in�nitive con-struction (for example, IPRP haveAUX thoughtV BN ↔ averAUX:infi pensatoV ER:ppast), theEnglish subject pronoun should be aligned with the in�nitive form of the Italian auxil-iary. Thus, the alignment between pronoun (PRP) and in�nitival auxiliary (AUX:in�)has to be allowed.Another possible in�nitival construction in Italian consists of a preposition (PRE )

and an in�nitive ( *:in�:* ), for example I believe, IPRP knowV BP this ↔ Credo diPREsaperloV ER:infi:cli. The pronoun I should be aligned with the Italian preposition di tosatisfy the condition of being aligned with the same word as its �nite verb.22 This isexpressed in the rule (82d).

(82) a. EN subject pronoun → IT �nite verbif IT does not have a subject pronounEN: PRP → IT: {VER:�n, VER2:�n, AUX:�n}

b. EN: subject pronoun → IT: subject pronounif IT has a subject pronounEN: PRP → IT: {PRO:pers, PRO:demo}

c. EN: subject pronoun → IT: gerundiveif IT is gerundive constructionEN: PRP → IT: {VER:geru, VER2:geru}

d. EN: subject pronoun → IT: in�nitival particle or in�nitive auxiliaryif IT is an in�nitive constructionEN: PRP → IT: {PRE, AUX:in�}

Figure 8 is an example for the alignment rule (82a).23 The rule (82a) leads to thealignment of the English subject pronoun I with the Italian �nite verb posso (VER2:�n).

I4/PRPOO

��

can5/MD tell6/V B you7/PRP

posso4/V ER2 : fin risponderle5/V ER : infi : cli

Figure 8: Alignment of I can tell you and posso risponderle

Figure 9 shows an example for the alignment rule (82b). The English personal subjectpronoun it is only aligned with the Italian pronominal subject esso.In �gure 10, a phrase pair is shown on which the alignment rule (82c) can be applied.It allows alignment of an English pronoun with an Italian gerundive.Figure 11 shows an example for the alignment rule (82d) which allows alignment of anEnglish pronoun with an Italian in�nitive auxiliary.

22Finite verb alignment rules are discussed in the next chapter.23The alignments marked with dotted lines are at this moment out of interest.

56

it20/PRPOO

��

actually21/RB passes22/V BZ

esso19/PRO : pers stesso20/ADJ aprova21/V ER : fin

Figure 9: Alignment of it actually passes and esso stesso approva

I0/PRPOO

��

would1/MD say2/V B

volendo0/V ER2 : geru dire1/V ER : infi

Figure 10: Alignment of I would say and volendo dire

I0/PRPOO

��

have1/AUX said2/V BN

aver0/AUX : infi detto1/V ER : ppast

Figure 11: Alignment of I have said and aver detto

5.4.3 Finite verbs

After dealing with English subject pronouns, now we examine verbal elements of Englishsentences containing a subject pronoun. Let us �rst examine the example base alignmentpresented in �gure 12.

I5/PRPOO

��

feel6/V BP55

uukkkkkkkkkkkkkk OO

��ritengo6/V ER : fin che7/CHE

Figure 12: Incorrect base alignment of I feel and ritengo

The sentences which contain the phrases in �gure 12 are shown in (83) and (84).

(83) Yes, Mr Evans, i feel an initiative of the type you have just suggested would beentirely appropriate.

(84) Sì,yes,

Onorevolemr

Evans,evans,

ritengobelieve

chethat

un'a

iniziativainitiative

delof

tipothe

chetype

leithat

proponeyou

siasuggest

assolutamentewould be

opportuna.absolutely appropriate.

57

The English �nite verb feel should be only aligned with the Italian �nite verb formritengo. The motivation for this assumption is that their semantic features are similar.They have the same word category and share the same verbal features (tense, �niteness,person, number).24 Following this idea, correct alignment of the phrases in �gure 12 ispresented in �gure 13. The base alignment link between feel and che is removed. TheEnglish verb in only aligned with the Italian verb.25

I5/PRPOO

��

feel6/V BP55

uukkkkkkkkkkkkkk

ritengo6/V ER : fin che7/CHE

Figure 13: Correct base alignment of I feel and ritengo

If parallel sentences both consist of a �nite VP, the �nite verbs in both languages shouldbe aligned to each other. This means that English words with PoS VBZ, VBD, VBP,AUX, MD are aligned with the Italian words with PoS VER:�n, VER2:�n, AUX:�n.This is stated in the rule in (85a).The English �nite verbs can also be auxiliaries (AUX ) or modals (MD). I refer to

both types of the verbs as auxiliaries. If an English (�nite) auxiliary is to be aligned,it should be aligned with Italian (�nite) auxiliary (or auxiliaries). If we have VPs thatdi�er in a voice (active vs. passive), the English �nite verb or auxiliaries should bealigned with Italian �nite auxiliaries or their participles (cf. rule (85b)).If the English VP consists only of one verb whereas the Italian VP is composed,

the English �nite verb should be aligned to all Italian verbs. For example, if we ex-amine parallel VPs youPRP saidV BD ↔ abbiateAUX:fin dettoV ER:ppast, we see that theEnglish �nite verb said bears the same verb features as the Italian composed VP [ab-biate detto]V P . They both express a past action. Thus, we would like to translate theEnglish past tense (in the example said) into the corresponding past tense in Italianwhich not only contains of the participle trascorso, but also of the auxiliary abbiate. So,the English verb should be aligned to both Italian verbs (verb alignment rules (85a)and (85c)). This alignment rule should lead to an alignment between an English wordwith PoS VBZ, VBD, VBP, AUX, MD and an Italian participle with PoS AUX:ppast,VER:ppast, VER2:ppast, VER:in�, VER2:in�. Furthermore, the combination of therules (85a) and (85c) satis�es the condition that the English pronoun and its �nite verbshould both be aligned with the same Italian �nite verb (if the subject pronoun doesnot exist in Italian).If the Italian parallel VP is a gerundive or an in�nitive construction consisting of a

24Although parallel verbs sometimes have di�erent verbal features, they should be aligned satisfyingthe condition that same word categories should be associated to each other.

25 In this work, only the alignment of verbal sentence elements have been modi�ed. The de�nition ofan alignment of subcategorized conjunctions is out of scope of this thesis. Furthermore, the removalof the link between feel and che in �gure 13 still allows the extraction of the translation phrase pairs(I feel ↔ ritengo) and (I feel ↔ ritengo che).

58

preposition, the English �nite verb should be aligned with the Italian gerundive (cf. rule(85e)), or with the Italian preposition (cf. rule (85d)), resp.Finite verb alignment rules are summarized in (85).

(85) a. EN �nite verb → IT �nite verb,EN: {VBZ, VBD, VBP, AUX, MD} →IT: {VER:�n, VER2:�n, AUX:�n}

b. EN: �nite verb → IT: participle form of auxiliaryif IT VP has a passive voiceEN: {VBZ, VBD, VBP, AUX, MD} → IT: {AUX:ppast}

c. EN: �nite verb → IT: participle of in�nitiveif EN VP is not composedEN: {VBZ, VBD, VBP, AUX, MD} →IT: {VER:ppast, VER2:ppast, VER:in�, VER2:in�}

d. EN: �nite verb → IT: in�nitival particleif IT is an in�nitive constructionEN: {VBZ, VBD, VBP, AUX, MD} → IT: {PRE}

e. EN: �nite verb → IT: gerundiveif IT is a gerundive constructionEN: {VBZ, VBD, VBP, AUX, MD}→ IT: {VER:geru, VER2:geru}

Figure 14 shows an alignment of the English �nite verb enjoyed after applying alignmentrules (85a) and (85b). The link between the English �nite verb and the Italian participleshould be only possible, if the English VP is not composed and the Italian VP consistsof an auxiliary and a participle or in�nitive.

you33/PRP enjoyed34/V BD44

tthhhhhhhhhhhhhhhhhh OO

��abbiate23/AUX : fin trascorso24/V ER : ppast

Figure 14: Alignment of you enjoyed and abbiate trascorso

If the English VP is composed, and thus, the condition for applying the rule (85c) is notful�lled, only the alignment rule (85a) can be applied resulting in an alignment shownin �gure 15.

you0/PRP have1/AUX44

ttiiiiiiiiiiiiiiiiirequested2/V BN

avete023/AUX : fin chiesto1/V ER : ppast

Figure 15: Alignment of you have requested and avete chiesto

59

Figure 16 shows an example for alignment rules (85a) and (85b). The English �nite verbwere is aligned with two Italian words: �nite verb siamo and the second auxiliary statiwhich is a participle. As already mentioned, English auxiliaries should be aligned withItalian auxiliaries.

we13/PRP were14/AUX44

ttiiiiiiiiiiiiiiiii OO

��

elected15/V BN

siamo17/AUX : fin stati18/AUX : ppast eletti19/V ER : ppast

Figure 16: Alignment of we were elected and sono stati eletti

Figure 17 shows parallel VPs of a di�erent type. Whereas in English, we have a �nitesubclause with the predicate had, in Italian, the in�nitival construction di avere is used.The sentences that the phrases in �gure 17 are extracted from are shown in (86) and(87).

(86) ... that everybody would make certain that they had adequate ...

(87) ......

chethat

tuttiall

si accertinoensure

di avereto have

unaa

formazioneeducation

adeguataadequate

...

...

Alignment rule (85d) allows the English �nite verb had to be aligned with the Italianin�nitival particle di. Combining this rule with rules (82d) for pronouns and (85c) for�nite verbs, the alignment shown in �gure 17 is computed.

they17/PRPOO

��

had18/AUX55

uukkkkkkkkkkkkkk OO

��di2/PRE avere3/V ER : infi

Figure 17: Complete alignment of they had and di avere

5.4.4 Participles, in�nitives and gerundives

The in�nitive, participle or gerundive form of a verb is a part of a VP if a VP is composed.Auxiliaries are used to build some tenses, but the meaning of a VP is provided by anin�nitive or participle form of the main verb. For this reason, the alignment rulesfor English in�nitives, participles and gerundives should lead to an alignment betweenEnglish in�nitives, participles and gerundives and Italian in�nitives, participles andgerundives. This is stated in the rule (88b). An example is shown in �gure 18.The alignment between English and Italian participles, in�nitives and gerundives ispossible only if the Italian VP is composed which is not necessarily the case. This wouldmean that in Italian, we could have a tense that does not require an auxiliary, so thatall English verbs, including the participle, should be aligned with the Italian �nite verb

60

you0/PRP have1/AUX requested2/V BN44

ttiiiiiiiiiiiiiiii

avete023/AUX : fin chiesto1/V ER : ppast

Figure 18: Alignment of you have requested and avete chiesto

as shown in �gure 19. The same holds for in�nitives which occur with modal verbs. TheEnglish participle should be aligned with Italian verb which have the same or similarsemantic features. The Italian verb form should be aligned with the English auxiliaryand the main verb in order to express the same tense. This leads to the de�nition of therule in (88a).

you/PRP have/AUX requested/V BN22

rreeeeeeeeeeeeeeeeeeeeeeeeee

chiedevate/AUX : fin

Figure 19: Alignment of you have requested and chiedevate

The rules handling these cases are summarized in (88). If we take a closer look at rule(88a), we see that in some cases, the information provided by PoS is not enough toapply the correct alignment rule. For example, the verb been has the same PoS (AUX )no matter if it is used as an auxiliary or as a main verb. Computing the alignment forbeen, we have to decide whether been is used as an auxiliary or as a main verb. If, forexample, a composed Italian VP is given, been as auxiliary (for example in they havebeen said) should be aligned with Italian auxiliary. If it is a main verb, it should bealigned with Italian main verb.26

(88) a. EN: participle, in�nitive or gerundive → IT �nite verb,if IT VP is not composed, or English verb is like subcategorizing an to-in�nitive, or English verb is beEN: {VBN, VB, VBG, TO}→ IT: {VER:�n, VER2:�n, AUX:�n}

b. EN: participle, in�nitive or gerundive→ IT:participle, in�nitive or gerun-dive, if IT VP is composedEN: {VBN, VB, VBG, AUXG, TO} →IT: {VER:ppast, VER2:ppast, VER:in�, VER2:in�, AUX:in�}

c. EN: participle → IT: participle of an auxiliaryif EN VP is not composed and IT not in passive voiceEN: {VBN} → IT: {AUX:ppast}

26Italian verb form stata has two di�erent PoS depending on a context in which it is situated. If it isa part of a passive construction, it has a PoS AUX:ppast, otherwise it is tagged as VER:ppast.

61

d. EN: participle → IT: in�nitival particleif IT is in�nitival constructionEN: {VBN} → IT: {PRE}

The English in�nitive like is another verb which I treat as an auxiliary, but only if itoccurs with a to-in�nitive, for example I would like to say as shown in �gure 20. If like isa part of a construction containing a modal verb (MD) and a to-in�nitive (TO + VB),it should be treated as an auxiliary, i.e. as a �nite verb. This ensures that it is alignedwith the same Italian �nite verb as the modal (here, would).

I/PRP would/MD like/V B22

rreeeeeeeeeeeeeeeeeeeeeeeeeeeee to/TO say/V B

vorrei/V ER : fin dire/V ER : fin

Figure 20: Alignment of I would like to say and vorrei dire

5.4.5 Negation

In this work, the English negation particle not is treated as a part of the VP and itsalignment should also be taken in account. The simplest case of the alignment of theEnglish negation is to associate it with the Italian negation as shown in �gure 21.

we5/PRP do9/AUX not10/RB44

ttjjjjjjjjjjjjjjjjadhere11/V B

noi6/PRO : pers non7/NEG rispettiamo8/V ER : infi

Figure 21: Alignment of we do not adhere and noi non rispettiamo

But this is not always possible. Since sentences in the used parallel corpus are notalways one-to-one translations of each other, it can happen that the negation exists inonly one of the given languages. On the other hand, it is also possible that the verb inone language already contains the negation (for example as an attached pre�x) whereasits counterpart does not, and requires therefore an explicit occurrence of the negation.

(89) EN: negation → IT: negationif IT VP contains a negationEN: {RB} → IT: {NEG}

The negation alignment rule in (89) allows for English negation only to be aligned tothe Italian negation particle. If there are some mismatches, English negation staysunaligned.

62

5.4.6 In�nitival particle

Since the English to-in�nitives which are considered as being subcategorised by the verbsare also handled, we need alignment rules for English in�nitival particle to. The rule issimple: It should be aligned with the Italian in�nitival particle (PRE ) if the Italian VPis an in�nitival construction (PRE + *:in� or simply *:in�). If this condition is notgiven, it should be handled as an in�nitive.

(90) a. EN: in�nitival particle → IT: in�nitival particleif IT is in�nitival constructionEN: {TO} → IT: {PRE}

b. EN: in�nitival particle ↔ EN: in�nitiveif IT is not an in�nitival construction

Figure 22 shows an example for the rule (90b). Behind this alignment, the �gure showsalso alignments for other English tokens computed by applying the rules (82a) for I,(85a) for suggest, and (85b) for present.

I/PRPOO

��

suggest/V BP44

ttiiiiiiiiiiiiiiiito/TO44

ttjjjjjjjjjjjjjjjjjj present/V B44

ttjjjjjjjjjjjjjjj

raccomando/V ER : fin di/PRE presentare/V ER : infi

Figure 22: Alignment of I suggest to present and raccomando di presentare

5.4.7 Alignment examples

In the following, a few examples of computed VP alignment are presented. The sentencesare taken from Europarl.

I13/PRPOO

��

shall14/MD55

uukkkkkkkkkkkkkkdo15/AUX22

rrfffffffffffffffffffffffffff

seguiro8/V ER : fin

Figure 23: Alignment of I shall do and seguirò

The phrases in �gure 23 are simple to align. Since there is only one verb in Italian, allEnglish words are aligned with it. To achieve an alignment between the English subjectpronoun I and the Italian �nite verb seguirò, the alignment rule (82a) has to be applied.The modal shall, which is recognised as a �nite verb is aligned with seguirò accordingto the alignment rule (85a). Finally, the link between the auxiliary do, which representsthe main verb of the given VP, is also computed by the rule (85a). This example alsoshows that the same tense (future tense) is formed di�erently in the given language pair.

63

Whereas the English needs a modal verb and an in�nitive, the Italian verb becomes asu�x to express the future tense.

we9/PRPOO

��

have10/AUX33

sshhhhhhhhhhhhhhhhhhupheld12/V BN44

ttiiiiiiiiiiiiiiiii

abbiamo9/V ER : fin sostenuto11/V ER : ppast

Figure 24: Alignment of we have upheld and abbiamo sostenuto

Figure 24 shows the alignment of composed VPs. The English personal pronoun and�nite auxiliary are aligned with the Italian �nite verb abbiamo. The subject pronoun isaligned according to the alignment rule (82a) whereas the English auxiliary is aligned tothe same verb according to the rule (85a). The participles are aligned with each other,which is determined by the rule (88b).Figure 25 shows an example for a VP pair, in which the Italian VP consists of a

subject pronoun.

you12/PRPOO

��

have13/AUXOO

��

suggested14/V BN44

ttiiiiiiiiiiiiiiiii

lei13/PRO : pers propone14/V ER : fin

Figure 25: Alignment of you have suggested and lei propone (= you proposed)

In this context, the English subject pronoun should only be aligned with the Italiansubject pronoun. This is stated in the alignment rule (82b). The English �nite verb isaligned with the Italian �nite verb according to rule (85a) whereas the alignment of theEnglish participle (as a main verb) is de�ned by the rule (88a).This example pair shows another discrepancy that I noticed by observing the identi�ed

phrase pairs. Often, VPs are not in the same tense. In �gure 25, the English VP is inpast tense whereas the corresponding Italian VP denotes an action in the present.The VP pair shown in �gure 26 shows the case in which the English subcategorized

to-in�nitive should be aligned with the Italian participle as a main verb (not subcate-gorized).27 The phrases are extracted from the sentences in (91) and (92).

he4/PRPOO

��

is5/AUX44

ttiiiiiiiiiiiiiiiiito6/TO66

vvlllllllllllllgo7/V B33

ssgggggggggggggggggggggggg

verra4/AUX : fin messo5/V ER : ppast

Figure 26: Alignment of he is to go and verrà messo

27The phrases are rather idioms. In Europarl, I found only 39 sentences containing the English VPwhereas the Italian VP occurs solely in 16 sentences.

64

(91) Now, however, he is to go before the courts once more because the public prose-cutor is appealing.

(92) Ora,now,

però,but,

verrà messowill come put

nuovamenteagain

inin

statoposition

diof

accusaaccusation

perchébecause

ilthe

pubblicopublic

ministerogovernment

ricorreràrecurs

inthe

appello.appeal.

Again, the English pronoun and �nite verb are aligned according to the rules (82a) and(85a) with the Italian �nite verb verrà. The in�nitive particle to is treated as an Englishparticiple, in�nitive or gerundive, since the Italian VP does not contain a preposition(as an in�nitive particle) which would be seen as an alignment candidate for Englishto. As a participle, in�nitive or gerundive, the in�nitive particle, as well as the Englishin�nitive go, is aligned with the Italian participle. The rule applied to compute thisalignment is the rule (88b).An Italian VP corresponding to an English �nite VP can also consist of only one verb

which is not necessarily �nite. Figure 27 shows an Italian VP consisting only of onegerundive. The link between the English pronoun and the Italian gerundive is producedby applying the alignment rule (82c). In the given context, the English �nite verb isaligned according to the rule (85e).

you4/PRPOO

��

hear5/V BP44

ttjjjjjjjjjjjjjjjj

ascoltando4/V ER : geru

Figure 27: Alignment of you hear and ascoltando

5.5 Evaluation

In this section, the VP alignment computed on the basis of the rules which take the PoSof the words into account, is evaluated. Precision, recall and f-score are computed forthe base alignment and the rule-based VP alignment. After comparing gained results,errors made by the rule-based VP alignment are shown and discussed. Furthermore,some examples of syntactic divergences between English and Italian that are problematicfor the system are shown.After an evaluation of the improved alignment, translation systems are built and

tested. BLEU scores are reported and the translations of example sentences withpronominal subjects are discussed.

5.5.1 Precision, Recall, F-score

The program for word alignment computation of the English and the Italian VPs hasbeen applied to 200 parallel sentences randomly chosen from Europarl (cf. section 5.2).The sentences consist both of NP and pronominal subjects.

65

The program for the alignment improvement produces a set of partial alignments,containing alignments only for identi�ed pronominal subjects and their VPs. The align-ments of other words are not a part of the output of the program.I annotated manually the alignment of the English pronominal subjects and VPs in the

test set with their Italian counterparts. The manual annotation of the test set providedthe partial gold alignment G containing 563 gold alignment links. The alignments of theEnglish words outside of the phrases (pronominal subject + VP) that were of interestfor this work were ignored (they are simply not word aligned in the hypothesis and thegold alignment).To evaluate the base alignment of English pronominal subjects and VPs, it was neces-

sary to extract the alignments of the relevant words out of the complete word alignmentfor a sentence pair. This was done on the basis of the word positions of aligned Englishwords in the gold alignment. The extracted base word alignment contains all links forthe elements of English VPs which are annotated in the gold alignment. So, if there arelinks to Italian words which are not a part of matching Italian VPs, they have a negativeimpact on precision.The alignment that is tested, is called hypothesis H. Having gold alignment and the

hypothesis, the evaluation method basing on precision, recall and f-score can be applied.Precision is a measure for the correctness of the hypothesis and is calculated as shownin equation (14).

P =H ∩G|H|

(14)

Recall is a percentage of gold alignments that are found by the hypothesis (cf. equation(15)).

R =H ∩G|G|

(15)

F-score is a harmonic mean of precision and recall, and it is computed as shown in (16).

R =2PR

P +R(16)

The evaluation results are shown in table 6.

Alignment # alignments Precision Recall F-scoreBase 522 0.66 0.61 0.64

Rule-based 572 0.80 0.81 0.81

Table 6: Evaluation of the VP alignment

In all measures, the rule-based VP alignment is better then the base alignment. Mea-suring f-score, the rule-based VP alignment reaches an improvement of 17% comparedwith base alignment. In a large number of sentences, the base alignment of VPs is bothincomplete and incorrect. Since the method described in this work identi�es entire VPs,

66

all VP elements are examined and aligned producing the VP alignment which containslinks between all VP elements. The alignment rules allow only alignments betweenelements which share some characteristics (word category, number, etc.) so that thealignments to some other word categories, which are incorrect, are excluded.28

In the following, examples are presented in which the rule-based VP alignment leadsto the improvement of the alignment compared to the base word alignment. We startwith an example that shows the most frequent correction of the base alignment. Asalready mentioned in section 5.1, the English subject pronoun is often aligned withdi�erent Italian words because its direct counterpart is missing. These include, amongothers, the Italian object clitics. The Italian syntax allows the object clitic to occur infront of the �nite verb. Therefore, it is often aligned with the English subject pronounwhich is always situated in front of the verbal sentence predicate (cf. �gure 28). Thebase alignment links are marked with waved lines whereas the rule based alignments aredisplayed by straight lines. Overlapping alignments are marked as a combination of awaved and a straight line.

I13/PRPOO

�� �O�O�OOO

�� �O�O�O

ii

))SSSSSSSSSSSSSSaccept14/V BPOO

�� �O�O�OOO

��la15/CLI accetto16/V ER : fin

Figure 28: Alignment comparison: I accept and lo accetto

In �gure 28, the alignment rules for the subject pronouns lead to the alignment of theEnglish subject pronoun I with the Italian �nite verb accetto (straight line) whereasthe link between the pronoun and the Italian object clitic is deleted (waved line). Bothalignments contain the link between the main verbs accept and accetto.The example in �gure 29 shows the advantage of using English parse trees. The

English VP is interrupted by an embedded sentence, but the derivation of the EnglishVP from the parse trees leads to the extraction of the complete VP, which is thencorrectly aligned with the Italian counterpart. The base alignment does not produceany alignments for the beginning part of the English VP, namely the word sequence itwill.

it0/PRP kk

++XXXXXXXXXXXXXXXXXXXXXXXX will1/MDhh

((RRRRRRRRRRRRR... be6/AUX44

tt t4 t4 t4 t4 t4 t4 t4 t4 t4 t4 t4 44

ttiiiiiiiiiiiiiiiiiiexamined7/V BN44

tt t4 t4 t4 t4 t4 t4 t4 t4 t4 t4 t4 44

ttiiiiiiiiiiiiiiiii44

ttiiiiiiiiiiiiiiiii

sara2/AUX : fin esaminata3/V ER : ppast

Figure 29: Alignment comparison: it will (, I hope,) be examined and sarà esami-nata

28For now, we assume that VP elements can only be aligned to VP elements, i.e. verbs, negation andsubject pronouns leading to higher precision. However, the assumption is not completely correcthaving a negative impact on recall. This will be discussed in the following chapter.

67

The following �gure shows the case in which the base alignment assigns the Englishsubject pronoun to the Italian adverb pertanto. The rule-based VP alignment deletesthis link and creates the alignment between the pronoun and the Italian �nite verb formpuò which corresponds to the English modal can. Furthermore, the link between theEnglish main verb give and the Italian preposition su is removed. 29

I12/PRPOO

�� �O�O�O

ii

))TTTTTTTTTTTTTTTcan13/MOD

OO

�� �O�O�OOO

��

jj

**UUUUUUUUUUUUUUUUU... give17/V BNOO

�� �O�O�O

55

uu u5 u5 u5 u5 u5 u5 u5 u5 u5 55

uujjjjjjjjjjjjjjj

pertanto13/ADV puo14/V ER2 : fin contare15/V ER : infi su16/PRE

Figure 30: Alignment comparison: I can (,therefore,) give and pertanto può con-tare su

In �gure 31, there is an example of an Italian in�nitival clause corresponding to theEnglish �nite clause. The contexts of the VPs are shown in (93) and (94).

(93) ... I would ask you to request that the commission express its opinion on thisissue and that we then proceed to the vote.

(94) ......

layou

pregoI ask

dito

chiedererequest

allathe

commissionecommission

dito

esprimersiexpress itself

subitosoon

eand

poiafterwards

di procedereto proceed

alto the

voto.vote.

Again, the English pronoun is aligned with the Italian adverb whereas the Englishin�nitive is only aligned with the Italian main verb. The alignment rules correct thealignment of the English subject pronoun and align it with the Italian preposition diwhich is considered to be a part of the Italian VP. Since the intention was to align theEnglish subject pronoun and its �nite verb with the same Italian word, the English�nite verb proceed is also aligned with di. Additionally, it is also aligned with its Italiancounterpart procedere.

we12/PRPOO

�� �O�O�O

gg

''OOOOOOOOOOO... proceed17/V BNOO

�� �O�O�O

55

uujjjjjjjjjjjjjjj OO

��poi13/ADV di14/PRE procedere15/V ER : infi

Figure 31: Alignment comparison: we (then) proceed and poi di procedere

Figure 32 shows one of the common base alignments for the English subject pronoun.It is namely often aligned with sentence punctuation, in our case with comma. The VPalignment rules remove this link and lead to the resulting alignment in which the Englishsubject pronoun and its �nite verb have are both aligned with the same Italian word.

29Cf. footnote 25 in section 5.4.3.

68

I0/PRPOO

�� �O�O�O

hh

((PPPPPPPPPPPPhave1/AUXOO

�� �O�O�OOO

��

... proposed3/V BN22

rr r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 r2 22

rrdddddddddddddddddddddddddddddddddddd

,2 /PUN ho3/AUX : fin proposto4/V ER : ppast

Figure 32: Alignment comparison: I have (thus) proposed and , ho proposto

In �gure 33, the VPs including the negation are shown. In the base alignment, theEnglish negation does not have any alignments whereas the rule-based VP alignmentassigns it to its Italian counterpart non. If the English VP contains a negation and anauxiliary which is needed to negate the verbal predicate (here re�ect), the auxiliary isaligned solely with the Italian main verb (here ri�etterà). Certainly, the auxiliary couldalso be aligned with the Italian negation since it is used to build a negated English VP. Idecided though to align the auxiliary with the Italian main verb because there are manyother English constructions containing an auxiliary and a main verb which correspond tothe Italian main verb (for example, I [do think]V P ↔ io [penso]V P , he [is playing]V P ↔egli [gioca]V P ). When such a context is given, the auxiliary and the main verb are bothaligned with the corresponding Italian verb if the Italian VP does not have an auxiliary(cf. �gure 25).

they1/PRPOO

�� �O�O�OOO

��

do2/AUXOO

�� �O�O�O

ii

))TTTTTTTTTTTTTTTnot3/RB55

uujjjjjjjjjjjjjjj... reflect4/V B33

ss s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 33

ssgggggggggggggggggggg

esso1/PRO : pers non2/NEG riflettera17/V ER : fin

Figure 33: Alignment comparison: they do not (properly) re�ect and esso nonri�etterà

In the VP pair in �gure 34, an Italian VP is shown consisting of re�exive verb perme-ttersi (= allow, permit). The Italian re�exive pronoun occupies the position in frontof the Italian �nite verb permettesse. This can be compared with the position of theItalian object clitics shown in �gure 28. The base alignment contains a link between theEnglish subject pronoun and the Italian re�exive pronoun. The rule-based VP align-ment, however, deletes this link and creates the alignment between the English pronounand the Italian �nite verb. Since the Italian re�exive pronouns have the same PoS asItalian object clitics, I excluded the alignment of the English subject pronouns withItalian words tagged with the PoS CLI. This allows for a deletion of many links createdbetween the English subject pronouns and the Italian object clitics, but it also prohibitsthe alignment between the English subject and the Italian re�exive pronouns whichcould be considered as correct.30 Furthermore, the rule-based VP alignment creates alink between allowed and permettesse which was incorrectly not included in the baseword alignment.

30Since I do not allow the alignment of English subject pronouns with Italian re�exive pronouns, thesealignments are not a part of the gold alignment.

69

I15/PRP jj

***j*j*j*j*j*j*j*j*j*j*j*j*j*j*j*j*jee

%%LLLLLLLLLLLLLLLLLLLLLLLLLLLLL

might16/V BP jj

***j*j*j*j*j*j*j*j*j*j*j*j*j*j*j*j*jjj

**UUUUUUUUUUUUUUUUUUUUUUUUUU mi15/CLI

be17/AUX oo ///o/o/o/o/o/o/o/o/o/o/o/o/o/o/o/ooo // permettesse16/V ER : fin

allowed18/V BNtt

44iiiiiiiiiiiiiiiiiiiiiiiiii di17/PRE

to19/TOtt

444t4t4t4t4t4t4t4t4t4t4t4t4t4t4t4t4ttt

44iiiiiiiiiiiiiiiiiiiiiiiiii rilasciare18/V ER : infi

give20/V Btt

444t4t4t4t4t4t4t4t4t4t4t4t4t4t4t4t4ttt

44iiiiiiiiiiiiiiiiiiiiiiiiii

Figure 34: Alignment comparison: I might be allowed to give and mi permettessedi rilasciare

The preceding examples show the cases in which the alignment rules lead to the im-provement of English and Italian VPs and, especially of the English subject pronoun.But the rule-based VP alignment still make errors in the alignment of about 20% of thetested sentences. In the following, we examine the errors that are made by the describedmethod for the VP alignment.

5.5.2 Error analysis

Manual examination of the erroneous alignments revealed problems which can be dividedinto four categories:

1. Correct Italian VP not foundThe parallel Italian VP is not found

2. Extended VPsThe VPs can contain coordinated verbs or in�nitives which do not have a corre-spondence in the other language

3. Alignment rulesThe rules compute false alignments when the VPs are too complex

4. Erroneous preprocessingVP elements can have false PoS.

In the following, the error categories are discussed. Sentence pairs and alignment exam-ples are shown to demonstrate the problems within the task of the VP alignment.

70

Correct Italian VP not foundAlignment rules de�ne alignments between English and Italian VPs. The correctness ofthe computed alignment for a given VP pair depends not only on the de�nition of therules, but also on an assumption that the VPs correspond to each other. The methodfor searching the corresponding Italian VP given an English VP, has been described inthe section 5.3.2. The method is based on the base alignment: The Italian VP whichhas the most alignments in the base alignment to the English VP is considered to bethe corresponding Italian VP. This is not always correct, so an incorrect Italian VP canbe chosen.If the English VP does not have any alignments to an Italian VP, an empty Italian

VP is chosen. In this case, the English VP stays unaligned. A sentence pair for this caseis shown in (95).

(95) a. As you know, like Mr. Rack, I come from a transit country ...b. Anch'

Alsoio,I,

comeas

l'the

Onorevolehonourable

Rack,Rack,

provengocome

dafrom

una

paesecountry

diof

transitotransit

'Like Mr. Rack, I also come from a transit country ...'

Whereas the English VP [you know]V P does not have a corresponding Italian VP, for theVP [I come]V P , the Italian VP [provengo]V P should be identi�ed as the correspondingphrase. Unfortunately, the base alignment does not reveal this fact, so that the EnglishVP stays unaligned which lowers recall.Until now, I postulated that for every found English VP with pronominal subject, there

is a parallel Italian VP. This phrase parallelism is not always present in the sentencepair which is to be processed. The English VP can correspond to an Italian phrase ofsome other category, for example, to a PP as shown in (96).

(96) a. We understand that ...b. A nostro avviso

At our notice

'In our opinion'

Having identi�ed the English pronominal subject we and its VP [understand]V P , thesearch for the Italian VP is carried out. VP search allows only VPs as correspondingphrases to the given English phrase, so that the PP [A nostro avviso]PP cannot bedetermined as the parallel phrase of the English VP, even though the base alignmentindicates this correspondence. In most cases of this kind of divergence, the English VPstays unaligned. Since the phrases are parallel, in gold alignment they are aligned toeach other, so this leads to a loss of recall.Another phrase divergence which has been observed is shown in (97).

(97) a. Your group was alone in advocating what you are saying now.

b. SoltantoOnly

unone

gruppogroup

politicopolitical

condividevashared

l'the

opinioneopinion

da lei espressaof you expressed

inin

questathis

sede.seat.

71

'Only one political group shared the opinion that you expressed in this seat.'

The English �nite VP [you are saying]V P corresponds to the Italian PP [da lei espressa]PPconsisting of the preposition da, the subject pronoun lei, and the participle espressa. Inthis form, it poses a counterpart to the �nite English VP, but in a passive voice. In theprocess of identi�cation of Italian VPs in a given Italian sentence (cf. chapter 5.3.1),this kind of phrase is not identi�ed as a VP, because it starts with a preposition andit does not contain a �nite verb form. The same problem occurs if the English �niteVP corresponds only to a participle in Italian. So, in these cases, we have English VPswhich stay unaligned leading to a reduction of recall.The problems with regard to the parallel Italian VPs can be summarized as follows:

1. Base alignment

• False VP because the base alignment is incorrect

• No VPbecause the base alignment does not contain links to any possible Italian VP

2. Phrase divergence (free translation, idioms)

• EN:VP ↔ IT:PP

• EN: �nite VP ↔ IT: participle

In section 5.6, I present experiments that I carried out in order to account for theseproblems.

Extended VPsIn the previous discussion, examples of VPs have been shown which consist only ofone main verb or subcategorized in�nitive. The VPs can also contain a sequence ofverbs which are either combined by a coordination, or pose an enumeration separatedby comma.

(98) a. It is irresponsible of EU Member States to refuse to renew the embargo.

b. GliThe

statistates

membrimember

dell'of the

unioneunion

sono statiwere

irresponsabiliirresponsible

a non rinnovareto not renew

l'the

embargo.embargo.

'It is irresponsible of Member States not to renew the embargo.'

The sentence pair in (98) shows the English VP [It is ... to refuse to renew]V P andits Italian counterpart [sono stati ... a non rinnovare]V P . The English VP containstwo to-in�nitives. The �rst one, namely [to refuse]XCOMP , does not have a direct VPcorrespondence in Italian. Moreover, semantically, it is equal to the Italian negation non.This type of correspondence is not described by the alignment rules. In this context, [to

72

refuse]XCOMP as well as [to renew]XCOMP are aligned with [a rinnovare]XCOMP whereasthe Italian negation remains unaligned.This sentence pair reveals another divergence in a way of expressing the same fact in

the given language pair. Whereas in English, the expletive has the role of a sentencesubject, in Italian, the subject is a NP [Gli stati membri dell' unione]NP which is atranslation of an English PP [of EU Member States]NP . This inequality exists also inthe processed VP pairs. In some cases, in which the English VP has a pronominalsubject, the corresponding Italian VP has a nominal phrase as a subject. Since theItalian subject was not taken into account unless it is a pronoun, the English pronoun isnot aligned with Italian subject NP, but instead with the corresponding element of theVP.Example (99) shows the Italian VP [è chiedere]consisting of an in�nitive chiedere that

does not correspond to any part of the English VP [It is not]. It is, therefore, seen asan extension of the Italian VP which leads to false alignments.

(99) a. It is not a lot to ask.

b. NonNot

è chiedereis to ask

molto.lot

'It is not a lot to ask.'

According to the parse tree for the English sentence in (99), the VP with the pronominalsubject does not contain the to-in�nitive [to ask]XCOMP . It is instead embedded in anadverbial phrase together with an adverbial lot. On the other side, the search for VPsin the Italian sentence returns only one VP, namely [è chiedere]V P . Given the VPs [It isnot]V P and [è chiedere]V P , the Italian in�nitive is aligned with the English �nite verb is,and not with [to ask]XCOMP . This additional false link leads to a reduction of precision.The described problems can be summarized as follows:

1. Subcategorization of in�nitivesPoses a problem if the in�nitive does not have equivalent phrase in the otherlanguage

2. CoordinationA coordination of verbs, in which not every verb has a counterpart in the otherlanguage

Alignment rulesThe de�nitions of the contexts in which a speci�c alignment rule should be applied is noterror-free. Additionally, since n −m alignments have to be allowed, a speci�c, alreadyaligned VP element is not prohibited to be aligned with further words. In complex VPswhich contain additional elements such as subcategorized in�nitives or a sequence ofverbs, the rules lead to a generation of too many links. For example, if in both VPs, twocoordinated �nite verbs are present, six links are generated, namely, from each English

73

�nite verb to each Italian �nite verb form, and from the English subject to both of theItalian �nite verbs. This is shown in sentence pair in (100).

(100) a. With regard to the budget and annual appropriations, we agree with the rap-porteur's position and fully support it.

b. Per quanto attieneRegarding

althe

bilanciobudget

eand

allethe

dotazioniappropriations

annuali,annual,

condividiamoshare

eand

appoggiamosupport

lathe

posizioneposition

dellaof

relatrice.the rapporteur.

'Regarding the budget and the annual appropriations, we share and supportthe position of the rapporteur.'

The computed alignment for the underlined VPs in (100) is shown in �gure 35. Therule (82a) leads to the alignments between the English subject pronoun and both Italian�nite verbs. The rule (85a) is responsible for alignments between both English �niteverbs with both Italian verbs.

we9/PRP oo //ff

&&LLLLLLLLLLLLLLLLLLLLLLLLLLLLL condividiamo10/V ER : fin

agree10/V BPtt

44hhhhhhhhhhhhhhhhhhhhhhhhhhjj

**VVVVVVVVVVVVVVVVVVVVVVVVVV ...

... appoggiamo12/V ER : fin

support18/V BP}}

=={{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{tt

44hhhhhhhhhhhhhhhhhhhhhhhhhh

Figure 35: Alignment of we agree (...) support and condividiamo (...) appoggiamo

While the alignments between the English subject and both Italian verbs can be con-sidered as correct, both English verbs should be aligned only with the correspondingItalian verb. So, the shown alignment consists of two additional false alignments whichlead to a reduction of precision.The English VPs with modal verbs pose the main problem for the rules and context

de�nition for alignment of VP elements. Figure 36 shows the computed alignment of anEnglish VP containing auxiliaries and modals, and its Italian counterpart.

it10/PRPOO

��

may11/MDOO

��

44

ttjjjjjjjjjjjjjjjhave12/AUX55

uukkkkkkkkkkkkkkk22

rreeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee been13/AUX11

qqcccccccccccccccccccccccccccccccccccccccccccccccccccc22

rreeeeeeeeeeeeeeeeeeeeeeeeeee

sia4/AUX : fin stata5/V ER : ppast

Figure 36: Alignment of it may have been and sia stato

74

In this example, too many links are computed. The English �nite verb should only bealigned with the Italian �nite verb. So, the link between may and stata is false. Englishhave as an auxiliary should also only be aligned with the Italian �nite verb sia. Theparticiple been as a main verb should only be aligned with the Italian participle andmain verb stata.False links are generated because of complex English context. Many rules check

whether the English VP contains auxiliaries, or if it contains modals. According tothe result of such a context check, the links for English verbs are computed. In the givenexample, this leads to a generation both of correct and incorrect links.

Head switchingHead switching is a phenomenon which involves syntactic and semantic di�erences be-tween languages. The main semantic contributor of a phrase in one language does notcorrespond to the head of the corresponding phrase in the other language [Butt, 94].For example, the main verb of a VP which bears the semantic information of the VPneed not always correspond to the main verb of the parallel VP in some other language.This kind of divergence is given in (101). The semantics of the English verb answercorresponds to the semantics of the Italian noun risposta.

(101) a. ... they had been answered in a previous part-session.

b. ......

avesserohave

giàalready

ottenutoreceived

rispostaanswer

inin

unaone

tornatasession

precedente.previous

'... they have already received the answer in the previous session.'

With respect to the word alignment, we can say that one verb in one language is equiva-lent to a combination of the verb and NP in some other language. In the example (101),the English verb answered corresponds to the Italian verb ottenuto and the object NPrisposta. For the given VPs, the alignment rules produce the alignments shown in �gure37.

they14/PRP jj

**UUUUUUUUUUUUUUUUUUUUUUUUUU

had15/AUX oo // avessero18/V ER : fin

been16/AUXtt

44iiiiiiiiiiiiiiiiiiiiiiiiii ottenuto20/V ER : ppast

answered17/V BNtt

44iiiiiiiiiiiiiiiiiiiiiiiiii

Figure 37: Alignment of they have been answered and avessero ottenuto (risposta)

75

The Italian object is not a part of the Italian VP which the English VP should bealigned with. So, the English answered verb is not aligned with the Italian objectrisposta. The produced alignment is incomplete since the link between answered andrisposta is missing.It is likely that the statistically computed base alignment contain the alignments

between a verb on the one side, and a verb and object NP on the other side (cf. example(101)). On the other hand, even if the object NP were a part of the Italian VP, thealignment rules would not allow for alignments between verbs and nouns or articlessince they only allow for alignments between words with PoS which indicate that thealignment candidates are a part of a VP, i.e. verbs, negation and subject pronouns. So,for the given case, alignment rules produce only some of the correct word alignments.The sentence pair in (102) shows another construction di�erence that is comparable

with the divergence in the example (101).

(102) a. ... you were unable to attend the Conference of Presidents last Thursday.

b. ......

lei non ha potuto partecipareyou not have could participate

giovedìThursday

scorsolast

allato the

conferenzaconference

deiof the

presidenti.presidents.

'... you could not participate on the Conference of Presidents last Thursday.'

The English VP [you were]V P is a part of a predicative phrase consisting of the mentionedVP and an adverbial unable, whereas unable subcategorizes the following to-in�nitive [toattend]XCOMP . In the parse tree, XCOMP is embedded in an adverbial phrase ADJP,so that it is not identi�ed as a part of the VP [you were]V P . This causes two di�culties:(i) unable as adverbial with the PoS JJ cannot be aligned with the equivalent Italianphrase [non ha potuto]V P (negation and verbs), and (ii) the Italian in�nitive partecipareis not aligned with its English equivalent [to attend]XCOMP . The computed alignmentfor example in (102) is given in �gure 38.Figure 39 shows a combination of the computed VP alignments (straight lines) and the

base alignment (dashed) for the VPs in (102). This combination of VP alignment and thebase alignment would be desirable as the output alignment, but the dashed alignmentsare not a part of the resulting word alignment for the given sentence pair. When thewords belonging to the VPs which should be aligned to each other are identi�ed, allbase alignments for these words are �rst deleted. Subsequently, the phrase elements arealigned according to the alignment rules, so that the words of the given VP pair can onlybe aligned to each other, and not to the words outside of them. In the given example,this leads to the deletion of correct links.

76

you4/PRP oo // lei5/PRO : pers

were5/V BD jj

**TTTTTTTTTTTTTTTTTTTTTTTTdd

$$JJJJJJJJJJJJJJJJJJJJJJJJJJJ__

��???????????????????????????????non6

ha7/AUX : fin

potuto8/V ER : ppast

partecipare9/V ER : infi

Figure 38: Alignment of you were (unable to attend) and lei non ha potuto parte-cipare

you4/PRP oo // lei5/PRO : pers

were5/V BD jj

**TTTTTTTTTTTTTTTTTTTTTTTTdd

$$JJJJJJJJJJJJJJJJJJJJJJJJJJJ__

��???????????????????????????????non6

unable6/JJ jj

**TTTTTTTTTTTT ha7/AUX : fin

to7/TO jj

**TTTTTTTTTTTT potuto8/V ER : ppast

attend8/V B oo //___________ partecipare9/V ER : infi

Figure 39: Alignment of you were unable to attend and lei non ha potuto parteci-pare

This type of link deletion problem could be solved by checking how reliable the align-ments for a given word with the words outside of the corresponding VP are. Becauseof the assumption that the elements of VPs should only be aligned to each other, thesecases of divergence have not been investigated further.

77

5.6 System extensions

In the previous section, I presented the errors made by the rule-based method for com-puting word alignment between the English VP with a pronominal subject and its Italiancounterpart. Some assumptions were made that, unfortunately, did not always lead toa generation of correct alignments. We saw that the process of searching for an ItalianVP on the basis of the base alignments can be erroneous (cf. assumption (A2) in section1.2). Furthermore, the assumption that the given English VP can only be expressedwith a VP in Italian does not hold in all cases (cf. assumptions (A1) and (A3) in section1.2). In the following, I suggest some improvements of the presented work, in order toconsider the problems that have been observed.

5.6.1 Lexical search for the matching Italian VP

In section 5.5.2, parallel sentences were shown, in which the wrong Italian VP has beenidenti�ed as a counterpart for the given English VP. Let us consider once more theexample shown in (103).

(103) a. As you know, like Mr. Rack, I come from a transit country ...

b. Anch'Also

io,I,

comeas

l'the

Onorevolehonourable

Rack,Rack,

provengocome

dafrom

una

paesecountry

diof

transitotransit

'Like Mr. Rack, I also come from a transit country ...'

The Italian VP [provengo]V P has not been identi�ed as the parallel VP to the EnglishVP [I come]V P . In this example, the similarity at meaning of come and provengo couldprovide us with the information, that these two VPs correspond to each other. So, itmay be that lexical translation probabilities could be helpful to identify the matchingItalian VP.There are two ways to include lexical translation probabilities in the subroutine for

�nding the matching Italian VP. The search could be changed, so that only a lexicalsearch for the Italian VP is carried out. Or, we can combine lexical search and thesearch based on the base alignment. I carried out two experiments including lexicaltranslation probabilities for the identi�cation of the parallel Italian VP. The �rst oneincludes only lexical probabilities whereas in the second, base alignment and lexicaltranslation probabilities are combined.The lexical search uses lexical translation probabilities computed by Moses based on

the base word alignment. For a given English VP e = e1, ..., en, the matching probabilityof an Italian VP i = i1, ..., im is computed using equation (17).

p(i|e) = m

√√√√ m∏k=1

arg maxel∈e p(ik|el) (17)

For each Italian word ik which is a part of an Italian candidate VP i, the highestprobability of generating it out of one of the elements el of English VP e is taken andmultiplied with highest probabilities of other Italian words within i. The mth root of

78

the product is computed in order to assure that shorter Italian VPs are not dispreferredcompared to the longer phrases.The most probable matching Italian VP imax for a given English VP e is the Italian

VP with the highest matching probability. The probabilities of the most probable ItalianVPs have to be higher that the threshold t since the probabilities can be relatively smallindicating that the phrases are not very likely to be parallel. This is shown in equation(18). I set manually the threshold t to t = 0.001. On the test set, this threshold led tothe best evaluation results. If the probability of Imax lays under the threshold, an emptyItalian VP is returned.

imax =

{arg maxi p(i|e) , if p(imax|e) > 0.001

[] , else(18)

The evaluating results for di�erent approaches for searching Italian VP are shown intable 7.

IT-VP search # alignments Precision Recall F-scoreLexical 556 0.68 0.67 0.67Base 572 0.80 0.81 0.81

Base + lexical 604 0.79 0.84 0.81

Table 7: Evaluation of VP alignment for di�erent IT-VP identi�cation approaches

The search based only on lexical probabilities does not lead to desirable results. Thisis due to the fact that verbs can have many di�erent translations, so that the mostprobable translation is not correct in every context. Furthermore, in equation (17), theword or phrase positions are not taken into account. For instance, it can happen thatthe position of the most probable Italian VP di�ers signi�cantly from the position of anEnglish phrase. This fact could indicate that the phrases do not match to each other butthe proposed computation does not have an access to this kind of knowledge. Finally,there are no checks as to whether a found Italian VP has already been identi�ed as aparallel VP of some other English phrase.Some of the mentioned problems can be partially solved if the base alignment and

lexical search are combined. This is done as follows: First, the base alignment search iscarried out. If no Italian VP is found, the lexical search is applied. The combination ofbase alignment and lexical search leads to a higher recall since some VPs are found whichhave not been identi�ed by the base alignment search. As an example for this case, weconsider the VP [I come]V P from English sentence in (103). The correct alignment isshown in �gure 40. The base alignment does not identify the Italian VP [provengo]V Pas a counterpart of the English phrase [I come]V P . In fact, it fails to �nd any Italian VPfor the given English VP which results in unaligned words of the English VP. Since noItalian VP has been found, the lexical search is applied. This search process �nds thecorrect Italian VP and the alignment rules de�ne correct alignments between the phraseelements.

79

I8/PRPOO

��

come9/V BP55

uujjjjjjjjjjjjjjj

provengo10/V ER : fin

Figure 40: Alignment of I come and provengo

Unfortunately, the combination of the two search methods lead to lower precision com-pared with precision of the base alignment. This has two reasons. First, there arecontexts in which an English VP should stay unaligned since it has no counterpart inItalian. Lexical search computes though a parallel VP if its translation probability ishigher that the threshold. Second, false VP is identi�ed since it has higher probabilitythan the correct parallel phrase. To demonstrate this, the English VP [you know]V Pfrom the example sentence (103) is taken. Its alignment is shown in �gure 41.

you1/PRPOO

��

know2/V BP55

uujjjjjjjjjjjjjjj

assume20/V ER : fin

Figure 41: Alignment of you know and assume

The base alignment does not identify any Italian VP as a counterpart for English [youknow]V P . So, the lexical search is applied suggesting that Italian VP [assume]V P is aparallel phrase to the given English VP which is false.

5.6.2 Retaining the base alignment

As already discussed, the English VP does not need to have an Italian VP as its counter-part. It can correspond to a phrase of some other type, for example, to a prepositionalphrase, or simply to a participle. The implemented method for VP alignment does notallow for this kind of parallelism. For an English VP, only an Italian VP can be foundas its parallel phrase. If the lexical search does not �nd any parallel Italian VP, insteadof not aligning the English VP, its base alignment could be retained. This could leadto correct alignments which cannot be created by the alignment rules, but it could alsolead to alignments which are incorrect. The results of the experiment for retaining thebase alignments are shown in table 8.

Alignment / Score # alignments Precision Recall F-scoreBase 522 0.66 0.61 0.64

Rule-based 567 0.80 0.81 0.81Rule-based + base 588 0.79 0.82 0.80

Table 8: Evaluation of di�erent VP alignments

80

Evaluation results show that retaining the base alignment for the phrases for which noalignment could be computed has a negative impact on precision.

5.7 Summary

In this chapter, I presented a method for the alignment of English and Italian VPswhich have pronominal subjects. The aim of the rules developed for VP alignment wasto correct the alignment of the English pronominal subject which often does not havean Italian counterpart, and which is therefore often aligned with incorrect Italian words.Since the alignment of English subject pronouns depends on the alignment of their

VPs, the rules were written to cover the alignment of entire English and Italian VPs.The de�nition of the alignment rules was motivated by both the linguistic and semanticcharacteristics of the verbs. Words which bear similar features (for example, number,de�niteness, person, etc.) are aligned to each other. The rules do not have any lexicalknowledge. They operate on the PoS sequences of the parallel VPs. The evaluationrevealed that the rule-based VP alignment reaches higher precision, recall and f-scorethan the base word alignment (cf. table 8). F-score of the base alignment is 0.64 whereasf-score of the rule-based VP alignment is 0.81.Parallel VPs have been extracted on the basis of the base alignment of each English

VP. Since English parse trees were available, in most cases the correct English VPs havebeen extracted. The Italian VPs were identi�ed on the basis of PoS sequences that forma VP (cf. section 5.3.1). The identi�cation of correct Italian VPs is not ideal; Moreerror-free VPs (consisting only of verbal elements that belong to the speci�c VP) couldhave been extracted, if Italian parse trees had been available.The identi�cation of parallel VPs, which is based on the base word alignment, is not

always correct. When additionally to the base alignment, the lexical translation proba-bilities are included in the search for parallel VPs, the recall has a small improvement,but precision falls (cf. table 7 in section 5.6.1). This is due to the fact that the searchmethod nearly always �nds a matching Italian VP for the English input. In some cases,the found Italian VP is correct, but there are also cases in which this is not the case.The examination of the parallel corpus showed that there are many syntactic diver-gences between English and Italian (cf. section 5.5.2). Frequently, the English VP doesnot have an Italian counterpart because the whole clause has not been translated (freetranslation). Furthermore, English VPs can also correspond to Italian PPs, participlesor to the arguments of Italian verbs. Such cases of phrase divergences have not beendealt with in this work.The alignment rules lead to satisfying alignments of the PoS sequences in the majority

of VPs, but they produce false links if they are applied to complex VPs (coordinatedverbs or to-in�nitives). They search through all VP elements and compute all possiblelinks, sometimes associating one English verb with two Italian verbs, and vice versa(cf. �gure 35, section 5.5.2). This is due to the implementation of the rules. Thereis no limitation on the number of the links that can be computed for an input word.This could be improved by using the lexical translation probabilities. If there are anumber of candidates that an English main verb could be aligned with, the lexical

81

translation probabilities and the word positions in the VP could be used to determinewhich alignment is the most probable while other links would be discarded.In this work, only those VPs have been dealt with that have pronominal subjects, but

the method presented in this section could also be applied on VPs with NP subjects.

In this section, we have de�ned the VP alignment rules and conducted an evaluation andan error analysis of the generated alignments. In the following section, the SMT sys-tems built using the two di�erent word alignments (base alignment and base alignmentcombined with rule-based VP alignment) will be presented and evaluated. A detailed ex-amination of translation parameters will be carried out in order to explain the evaluationresults.

82

6 Evaluation of SMT systems

In this section, I present an evaluation of four SMT systems. For each translationdirection, I built two SMT systems: a baseline system (M1) and a system using rule-based VP alignment (Mmod). I introduce the evaluation measure BLEU and present theBLEU scores of M1 and Mmod systems. I discuss why the improved VP alignment doesnot lead to the improvement of the translation of null subjects. Subsequently, I discusspossible solutions of the problem.

6.1 The BLEU score

In the previous chapter, it has been demonstrated that the word alignment is improvedby applying the alignment rules to the base alignment. We will now evaluate whetherthe word alignment improvement has an impact on the quality of generated translations.In this work, the quality of translation is measured using BLEU [Papineni et al., 02].

The computation of the BLEU scores takes into account the similarity between thegenerated translation (hypothesis) and one or more reference translations which arecorrect translations of the sentence which is to be translated. The similarity is expressedby a modi�ed n-gram precision pn. The sentences are viewed as a set of n-grams, i.e.word sequences of the length n. The count of a n-gram is clipped to the maximumnumber of occurrences of the n-gram in one of the references. The modi�ed precision ofa n-gram of the length n is computed by summing over the matches for every hypothesisC in the whole corpus Candidates. This is expressed in equation (19).

pn =

∑C∈{Candidates}

∑n−gram∈C Countclip(n− gram)∑

C′∈{Candidates}∑

n−gram′∈C′ Countclip(n− gram′)(19)

Additionally to the modi�ed n-gram precision, the BLEU score also considers the lengthc of the hypothesis: It should be not too short compared with the reference which hasa length r. This is expressed by the brevity penalty BP in (20). Too long sentences arealready penalized by lower precision.

BP =

{1 if c > r

e1−rc if c ≤ r

(20)

The BLEU score is computed by combining n-gram precision and the brevity penalty asdemonstrated in equation (21). N represents the maximum length of n-grams (usuallyN = 4) whereas weights wn are uniform: wn = 1/N .

BLEU = BP · exp(N∑n=1

wn log pn) (21)

83

6.2 Evaluation of SMT systems

I have built four SMT systems, two for each translation direction. The baseline SMTs(M1) use the base word alignment produced by GIZA++ whereas the other two systems(Mmod) use the modi�ed word alignment. All systems are built on a parallel corpuscontaining 749,646 sentence pairs. The same corpus was used to build language models.As a dev and a test set, I used the WMT Newstest 2009.31

All sets (development and test sets) contain 1000 sentences. For computation ofBLEU, one reference sentence was used. The evaluation results (BLEU scores) areshown in table 9.

IT → EN EN → ITBaseline SMT (M1) 22.07 19.15Improved WA (Mmod) 21.81 18.18

Table 9: BLEU scores of the SMT systems for EN ↔ IT

6.3 Error analysis

The BLEU scores are slightly worse for Mmod. But, a closer look at the generatedtranslations by the base and modi�ed systems shows that the translations are nearly thesame. Often, the sentences di�er only in synonyms. Such di�erences can unfortunatelyhave a strong impact on BLEU scores. In the following, a detailed analysis of thetranslation of subject pronouns is presented and discussed.

Translation direction IT → ENManual examination of subject pronoun translations revealed that both systems performequally well. Looking at the translations, I noticed that some null subject pronouns arebetter translated than others. The 1st and 2nd person pronouns seem to be easier totranslate than 3rd person pronouns. This could be explained by the fact that the use ofthe pronouns for the 1st and 2nd person is more common. For example, if someone speaksfor himself, he would rather refer to himself by a pronoun than by a NP. Concerning aparallel corpus, this means, that the English subject pronoun for the 1st person singularis very likely to occur together with an in�ected Italian verb with omitted subject. Thisleads to a higher probability of translating the Italian VP with an in�ected �nite verbinto the corresponding English pronoun and VP (and vice versa). In table 10, twopossible translations for the Italian verb form so (= I know) are shown.32

Table 10 also shows the di�erence in probabilities between the two systems indicatingthat the rule-based word alignment of VPs does have an impact on translation prob-abilities. The English pronoun I occurs in 56% of possible translations for the Italian

31http://www.statmt.org/wmt09/32The column phrase count shows how often the SL phrase has been extracted. The column phrase

pair types denotes the number of di�erent translations of the SL phrase.

84

so → i know know phrase count phrase pair typesM1 0.5546 0.1611 1,850 190Mmod 0.6202 0.0113 1,851 228

Table 10: Translation probabilities for so into (I) know

verb so in M1. In Mmod, the English pronoun is found in 68% of phrases. The �ve mostprobable phrase translations for so are shown in table 11.

so → M1 Mmod

0.5546 i know 0.6202 i know0.1611 know 0.0918 i am0.0497 i am 0.0448 i am aware0.0373 i am aware 0.0189 i understand0.0178 i 0.0113 know

Table 11: Top �ve translation phrases for so

The probability of generating English verb know without subject when Italian translationphrase so is given, is higher in M1 than in Mmod. This is the result of the rule-basedVP alignment. The rules lead to alignment between the English subject pronoun withthe same Italian verb as the corresponding English �nite verb. Thus, they enforce thatthe Italian verb so is aligned only with English word sequence I know. This alignmentleads to higher probability of extracting phrases (so, I know) compared to the phrasepair (so, know). In M1, the phrase pair (so, know) was extracted 298 times whereas inMmod the translation pair was extracted only 21 times. This means that in Mmod thein�ected Italian verb so was only 21 times not aligned with the English pronoun when itoccurred with the English verb know. In these cases, it is likely that English clauses hadNP subjects (due to free translation), so that the VP alignment rules were not applied.The 2nd person singular pronouns are a little bit more complicated. My intuition is

that their usage is comparable with the usage of the 1st person pronouns. But lookingat the generated translations, I observed that, often, incorrect English subject pronounsare generated. This is due to the ambiguity of the Italian verbs. Example sentence (31)(cf. chapter 3.2.1) already showed such a case of ambiguous Italian verbs. The sameexample is showed again in (104a). The translation produced by M1 and Mmod is shownin (104b).

(104) a. Haihave

dettosaid

chethat

parlispeak

italiano.Italian.

'You said that you speak Italian.'

b. You have said that speaks Italian.

85

Both translation systems generate the same translation for the sentence in (104a). Theinput was segmented into the following phrases.

(105) [Hai]p1 [detto]p2 [che parli]p3 [italiano.]p4

The phrases p1 and p2 generate correct English pronoun and verb, but the phrase p3 leadsto a false translation, which does not have the obligatory subject pronoun. Furthermore,the verb parli could indicate the 2nd person singular indicative, or the 3rd person singularconjunctive. The phrases that speaks and che parli are parallel if that and che are relativepronouns. They are very likely to be translated into each other. In this example,this interpretation of this and che is wrong. Since the SMT systems do not have thisknowledge, in this case they produce incorrect translations.

che parli → that that that phrase phraseyou speak she speaks speaks count pair types

M1 - 0.125 0.125 8 8Mmod - 0.125 0.125 8 8

Table 12: Translation probabilities of che parli into that (you/she) speak(s)

Translation table 12 for the phrase che parli shows that the correct translation for thephrase is not included at table at all. There is no di�erence in the probability distributionfor the phrase che parli between M1 and Mmod since che is in nearly all sentences usedas a relative pronoun. Hence, a parallel English sentence does not have a personalsubject pronoun which is necessary for applying the VP alignment rules. Looking at thetranslations of parli shown in table 13, the correct translation phrase is present, but ithas a very small probability.

parli → you speak speak you talk talk phrase count phrasepair types

M1 0.009 0.099 - 0.054 111 71Mmod 0.0083 0.075 0.0083 0.0333 120 85

Table 13: Translation probabilities of parli into that (you) speak/talk

With respect to the 2nd person pronouns, I also noticed that the Italian verbs for thesecond person singular indicative are very rare in the corpus that was working with (cf.chapter 2.4) which has a negative impact on translating them into English.The 3rd person pronouns are most di�cult to translate. They do not occur as fre-

quently as the other pronouns with the corresponding verb form. The intuition is thatthe verbs marking the 3rd person occur very often with the subject NP. In the wordalignment, the verbs of the given language pair are aligned to each other. In the phraseextraction step, they are then extracted as parallel phrases, and the English VP doesnot contain a subject pronoun. As a translation example, we consider the sentence in(106a). M1 and Mmod generated the translation shown in (106b).

86

(106) a. Hannohave

cantatosang

la miamy

canzone.song.

'They sang my song.'

b. Have been sung my song.

The input sentence has been segmented as follows:

(107) [Hanno]p1 [cantato]p2 [la mia]p3 [canzone.]p4

In this example, the translation of the phrases p1 and p2 is crucial to become correctoutput. First, we examine the translation probabilities of hanno as an in�ected Italianverb (cf. table 14).

hanno → have they have phrase count phrase pair typeM1 0.5761 0.0264 15,589 1961Mmod 0.5552 0.0388 15,292 2230

Table 14: Translation probabilities of hanno into (they) have

The di�erence in translation probabilities between the phrase pairs (hanno, have) and(hanno, they have) is huge. Even if the language model gives lower scores to the gen-erated sentences which do not have a subject (pronoun) at the beginning of a sentence,this might still happen. The rule-based VP alignment leads though to higher number ofoccurrences of the phrase pair (hanno, they have). Whereas in M1 it was extracted 412times, in Mmod the phrase pair was extracted 593 time. The English pronoun they occursin 6% of possible translations in M1 whereas in Mmod, it occurs in 12% of translationphrases.Similar behaviour is also observed in the case of the in�ected main verbs (cf. table 15):

They is a part of 15% of the translation phrases in M1 whereas in Mmod, it is includedin 27% of the phrases. Compared to the phrase pairs in table 14, the rule-based VPalignment leads to small changes regarding the counts for the translation phrases ofpensano.

pensano → think they believe they phrase phrasethink believe count pair type

M1 0.2251 0.0471 0.0733 0.0157 191 92Mmod 0.2126 0.0435 0.0628 0.0241 207 104

Table 15: Translation probabilities of pensano into (they) think

Further examination revealed another problem, namely regarding morphology of Italianand the corpus characteristics. The sentences in (108a) and (109a) di�er only in thegender of the subject which is marked by the Italian participles statafem and statomasc.Sentences in (108b) and (109b) are generated translations of the sentences (108a) and(109a).

87

(108) a. Leiyou/she

nonnot

èhave

statabeen

aat

casa.home.

'You/she were/was not at home.'

b. You was not at home.

(109) a. Leiyou

nonnot

èhave

statobeen

aat

casa.home.

'You were not at home.'

b. You were not at home.

Whereas the word sequence Lei non è stato has been extracted as a translation unit,this was not the case for Lei non è stata. In translation process, this led to followingsegmentation of the sentences.

(110) [Lei]p1 [non è stata]p2 [a casa.]p3

(111) [Lei non è stato]p1 [a casa.]p2

Translation of the phrase p2 in (110) leads to the generation of the false English verb.But, when the subject pronoun is a part of the translation phrase as in phrase p1 in(111), the correct translation is generated. Unfortunately, the Italian pronoun lei occursonly with the masculine participle of the verb essere (= be) in the training data, so theneeded phrase was not extracted.In conclusion, we have shown that the improved VP alignment does not contribute

to the improvement of translating the omitted Italian subject pronouns into English.The rule-based VP alignment does change the translation probabilities of the relevanttranslation pairs. Correct translation pairs were found, which have higher probabilitiesin Mmod. Furthermore, it has been observed that the English subject pronouns are morefrequently a part of the phrases extracted from the modi�ed word alignment. Unfortu-nately, these changes do not have an impact on the generated translations. Incorrectsubject pronouns in English are generated not because of erroneous word alignment, butbecause of the nature of using subject pronouns. Frequently used subject pronouns (1st

and 2nd person pronouns) are often correctly generated. They occur more often with thecorresponding Italian in�ected verb and can therefore be extracted as translation pairswith relatively high probabilities. 3rd person pronouns are relatively rare and lead to theextraction of the corresponding translation pairs with relatively small translation prob-abilities.33 The English language model which I used was trained on a relatively smallmonolingual data set. A better language model could in some cases lead to generationof correct obligatory English subject pronouns which were false in the examples shownin this section.In the preceding discussion, I claimed that an Italian in�ected verb should generate

an English subject pronoun and verb. This will though lead to erroneous translationsif in Italian a NP subject exists. When, for example, the Italian verb hanno has to betranslated, it is required that it can be translated both as the English verb have and

33Statistics on the occurrence of di�erent subject pronouns in English are shown in appendix C.

88

the English phrase they have (cf. table 14). Which translation is correct depends on theItalian input. If the input sentence does not have a subject (the pronominal subject isdropped), the English phrase they have should be generated. If the Italian input containsa NP subject, the Italian verb should be translated as the corresponding English verb(without the subject pronoun). Therefore, both translation phrase pairs are correct inan adequate context.

Translation direction EN → ITIn the following, we examine the opposite translation direction and check if the rule-based VP alignment contribute to the translation of the English subject pronoun and itsVP into the correct Italian VP. As already discussed in section 3.2.2, when translatingthe English subject pronoun into Italian, it has to be decided if the Italian pronounshould be expressed overtly, or if it should be omitted. In SMT, this decision is madeimplicitly by using the translation probabilities of English phrases in combination withthe Italian language model.When examining test sentences, I noticed that 3rd person singular pronouns are often

generated whereas the others are more often omitted. Again, I presume that this is dueto the usage of the pronouns. English 1st person pronouns are very frequent and arevery likely to occur with di�erent Italian VPs with omitted subject. Therefore, they arevery likely to be extracted as parallel phrases in which the Italian phrase does not havea pronoun. This is con�rmed by the phrase translation tables tables 16 and 17.

i can → posso io posso phrase count phrase pair typeM1 0.5712 0.0024 2,902 594Mmod 0.6341 0.0034 2,963 372

Table 16: Translation probabilities of i can into (io) posso

In M1, the phrase pair (I can, posso) was extracted 1650 times whereas in Mmod it wasextracted 1879 times. The di�erence in number of the phrase pair (I can, io posso) isrelatively small. In M1, the phrase pair (I can, io posso) occurs 7 times, and in Mmod

10 times. When English and Italian sentences have both pronominal subjects, it is verylikely that the pronouns are aligned with each other. But it is not excluded that theEnglish pronoun is aligned with additional Italian words which would hava an impact onthe extraction of translation phrases. The VP alignment rules prohibit these additionalalignments which could explain the higher count of (I can, io posso) in Mmod than inM1. The same observation can be applied on the phrase pairs shown in table 17.

we know → sappiamo noi sappiamo phrase count phrase pair typeM1 0.6125 0.0157 2,145 372Mmod 0.5358 0.0139 2,736 551

Table 17: Translation probabilities of we know into (noi) sappiamo

89

The �ve most probable translations for we know are shown in table 18.

we know → M1 Mmod

0.6125 sappiamo 0.5358 sappiamo0.0181 è noto 0.0259 conosciamo0.0176 sappiamo bene 0.0186 è0.0171 conosciamo 0.0179 sappiamo che0.0167 si sa 0.0157 sappiamo bene

Table 18: Top �ve translation phrases for we know

I examined if there is a di�erence in the number of Italian phrases aligned with theEnglish phrase we know which contain verbs which are equivalent to the English verbknow, namely sappiamo and conosciamo. Whereas these verbs are found in 34% ofphrases in M1, in Mmod, they are a part of 38% of the translation phrases. The di�erenceis due to the VP alignment rules which allow English VPs to be exclusively aligned ItalianVPs (which in this work contain only verbal elements and negation). This is also thereason why Italian phrase è noto (VER:�n + ADJ) has a smaller probability in Mmod

whereas the �nite verb form è is more probable than in M1.The third person pronouns are not very frequent and occur with a relatively small

number of verbs. In the process of translation, if the English subject pronoun and VPare not in the phrase table as a translation unit, they are split resulting in two separatetranslation phrases: a phrase with the subject pronoun and a phrase with its VP. This isshown by the sentence in (112a) and its segmentation in (113). The generated translationis shown in (112b).

(112) a. He has spoken with his father.b. Egli

hehahas

parlatospoken

conwith

ilthe

suohis

padre.father.

(113) [He]p1 [has spoken with]p2 [his]p3 [father]p4 [.]p5

It is very likely that he will be translated into the corresponding Italian pronoun (cf.table 19). The phrase has spoken with generates the correct Italian VP. The resultis a sentence with an explicit subject pronoun. When the sentence is isolated, thisis acceptable, but in a larger text, if a large number of Italian subject pronouns aregenerated where null subjects could be used, the translation would sound unnatural.

he → egli lui ha phrase count phrase pair typeM1 0.1634 0.03 0.1146 5,981 1616Mmod 0.4719 0.0861 0.0168 2,740 505

Table 19: Translation probabilities of he into egli, lui and ha

Table 19 also shows the impact of the rule-based VP alignment on the phrase translationtable. If the Italian pronoun is available, the English pronoun is only aligned with it. If

90

this is not the case, it is aligned with the Italian verb form. In M1, the phrase pair (he,egli) is extracted 967 times, whereas in Mmod it was extracted 1293 times. This leadsto a very high probability of translating the English pronoun into the Italian pronoun,which is correct, but it leads to too few occurrences of the null subject in Italian.

he has → M1 Mmod

0.2639 ha 0.3391 ha0.0926 che ha 0.0739 che ha0.0847 egli ha 0.0716 egli ha0.0236 è 0.0414 è0.0197 abbia 0.0235 abbia

phrase count 1514 1787phrase pair type 456 402

Table 20: Top �ve translation phrases for he has

However, if the segmentation of the English sentence had included the phrase he has,it would have been more probable that the generated Italian sentence does not have asubject pronoun (cf. table 20).Splitting the English subject pronoun from its VP could also lead to the generation

of false Italian in�ection since the English verbs have poor morphology. Given, forexample, the translation phrase desired without the subject pronoun, it is very likelythat the wrong Italian verb is generated if the language model does not penalise theerroneous Italian word sequence. Although I expected to see such errors, I was not ableto �nd them in the tested sentences.I also noticed that the 2nd person pronoun you is often translated as lei meaning you

in the polite form of address. An example of such a case is shown in the sentence in(114a) whose translation is shown in (114b).

(114) a. I can understand that you are annoyed.b. Capisco

understandchethat

leiyou

èare

arrabbiato.annoyed

The English phrase you are corresponds to three possible Italian phrases. You cancorrespond to the pronouns for the 2nd person singular tu and plural voi, and the 3rd

person singular lei (polite form of address). The SMT systems cannot resolve thisambiguity and choose the most probable phrase, in this case, the phrase for the politeform of address: lei è. Without any context, a human translator, however, would alsohave problems deciding which Italian VP is the correct translation of the English one.Within the context, if it is clear that the 2nd person singular is meant, the generatedtranslation would be wrong. Certainly, this way of translating you are is caused by thecorpus that has been worked with. But, even if we had more evidence for translatingyou are into other possible Italian constructions, the ambiguity would still be a problem.In summary, if the English subject pronoun and its VP are not included in the trans-

lation table as a translation unit, they are split resulting in the generation of the Italian

91

subject pronoun which could (or should) be omitted. Also, at least theoretically, Englishtranslation phrases consisting of verbs without the subject pronoun could lead to thegeneration of Italian VPs with false in�ection, since one English verb often correspondsto a number of Italian verbs.

6.4 Adequate training data

After the discussion in the previous sections, infrequent use of pronouns seems to posethe greatest problem for SMT in translating pronominal subjects. The question thisraises is: if we had a corpus containing a large number of sentences with pronominalsubjects occurring with many di�erent verbs, would this solve the problems presentedpreviously?We would certainly have phrase pairs EN : prpi + vpj → IT : vpk with high transla-

tion probabilities which could improve translation results when translating the Englishsubject pronoun prpi with VP vpj into the Italian null subject and the correct VP vpk.If the English pronominal subject is not split into a separate phrase from its VP, theovergeneration of Italian subject pronouns could be avoided.But, if we would like to translate the Italian null pronoun into the correct English

pronoun and VP, this would lead to another problem. Suppose that an Italian VP withNP subject should be translated, and the Italian VP is a translation unit. If it has ahigh probability being translated as an English pronoun and a VP, we would incorrectlyhave two subjects: a translation of the Italian NP subject and the English pronoungenerated out of Italian VP. Since both translations have to be possible, it is importantthat both translation alternatives have comparable probabilities: IT : vpi → EN :prpj + vpk and IT : vpi → EN : vpk. The Italian VP vpi must have a probable Englishtranslation phrase consisting both of the pronoun prpj and the VP vpk, and a phraseonly consisting of the VP vpk. To make sure that an additional subject pronoun inEnglish is not generated, it would be necessary to determine the subject of the Italiansentence. Having information about the subject, correct translation phrase pairs couldbe favored compared to the other.Problems regarding ambiguities of verb in�ection would, however, still exist. To re-

solve them, information from the context outside of the phrase pairs is needed. Thislack of a model of context is a known �aw of phrase-based statistical machine translationwhich has only recently been addressed in a preliminary fashion in the literature.

92

7 Conclusion

In this work, a detailed analysis of the problem regarding the translation of the pronom-inal subjects within statistical machine translation is carried out. A null subject lan-guage (NSL) Italian and a non-null subject language (non-NSL) English were used. Arule-based method for aligning English and Italian VPs with pronominal subjects ispresented. The rule-based VP alignment was used to build phrase-based SMT systemsin order to examine if the more accurate word alignment of VPs would lead to the im-provement of the pronominal subject translation. Unfortunately, this was not the case.The usage of subject pronouns and the corpus characteristics have a signi�cant in�uenceon extracting the correct translation pairs. Phrase-based SMT is not adequate for thepronoun translation and generation since it does not have any information about thecontext outside the translation phrases.The main �ndings of the work are summarized in section 7.1. Future e�ort in improv-

ing translation of (null) subject pronouns is outlined in section 7.2.

7.1 Summary

In some languages like Italian, overt subject pronouns are not obligatory. The verbalmorphology is rich enough to reveal characteristics like person and number of the missingpronoun (cf. section 2.1). Italian subject pronouns are used when they ful�ll somespeci�c functions like emphasis, reintroducing referents, etc. (cf. section 2.3). On theother hand, some languages like English rarely allow the omission of subject pronouns.English syntax generally requires that the subject position is occupied, otherwise, thesentence is not grammatically correct.The optional use of the subject pronoun in Italian and the obligatory use of the sub-

ject pronoun in English leads to problems in word alignment of parallel sentences withpronominal subjects, as well as in statistical machine translation. Until now, the prob-lem of translating (null) subjects between a NSL and a non-NSL has been dealt withonly indirectly. An overview of previous work was given in section 3.1. The analysis ofdi�erent translation cases showed that in many cases, Italian in�ected verbs can provideinformation needed to generate the correct English subject pronoun. Problems arisewhen the verbs are ambiguous with respect to the person, number and/or gender. Forexample, an Italian �nite verb which is 3rd person singular does not have informationabout the gender of the missing subject. This can lead to the generation of the falseEnglish pronoun. A further problem is the gender discrepancy between languages. Forexample, whereas animals have the grammatical gender neutral in English, in Italianthey can be both feminine and masculine. Various translation cases and problems arediscussed in section 3.2. In many cases, the examination of the context (previous sen-tence(s)) is required to derive all information that would ensure the generation of thecorrect English pronoun. Most (statistical) machine translation systems do not use thecontext, but translate sentences as isolated translation units.As already mentioned, the absence of Italian subject pronouns causes problems in the

word alignment task. Suppose that an Italian and an English sentence pair containing

93

pronominal subjects has to be word aligned automatically. It is very likely that theEnglish subject pronoun does not have a direct Italian counterpart since Italian allowsfor subject pronoun omission (cf. table 2, section 2.4). For this reason, English subjectpronouns are often aligned with Italian object clitics, conjunctions, etc. I developedalignment rules which de�ne the word alignment of English subject pronouns. Englishsubject pronouns have to be aligned with Italian words with the same linguistic informa-tion (person, number, gender). If the Italian subject pronoun is expressed overtly, theEnglish subject pronoun is aligned with it. If the subject is dropped, the English subjectis aligned with the Italian �nite verb form. In addition to the rules for the alignment ofEnglish subject pronouns, I developed rules for the alignment of VPs (verbal elements ofa VP and negation). The rules are based on the category of the VP elements (�nite verb,auxiliary, participle, etc.). I used English parse trees enriched with functional tags (cf.section 5.2.1) and part of speech tagged Italian sentences (cf. section 5.2.2). The processof aligning parallel phrases consists of several steps. An Italian sentence is searched inorder to �nd all Italian VPs (cf. section 5.3.1). In the parallel English sentence, theclauses with pronominal subjects are detected. Baseline word alignment of the elementsof an English VP (created by GIZA++) is used to identify the matching Italian VP (cf.section 5.3.2). The alignment rules compute the alignment of the phrase pair elementsby searching for speci�c PoS pairs in a speci�c PoS sequence. A detailed description of15 alignment rules for Italian and English VPs is presented in section 5.4.The rules were applied on a test set containing 200 parallel sentences. The evaluation

results (precision, recall, f-score) indicate that the VP alignment computed by the rulesis better than the baseline alignment computed by GIZA++ (cf. table 6, section 5.5.1).Expressed in f-score, the rule-based VP alignment exhibits an improvement of 17% (f-score = 81%). Precision of the baseline VP alignment is 66% whereas the precision ofthe rule-based VP alignment is 80%. Recall of the base alignment is 61% whereas therecall of the rule-based VP alignment is 81%.False alignments are computed if false parallel VPs are identi�ed. Not every English

VP has a parallel Italian VP. Due to free translation, English VPs can correspond toItalian PPs, participles, or they are simply not translated. These cases cause problemsfor the rule-based VP alignment. The process of the identi�cation of the matchingItalian VP for an English VP does not always �nd the correct Italian VP. Since the VPalignment rules take only the PoS of the phrase elements into account, in these cases,they compute false word alignment. Furthermore, the implementation of the rules isinsu�cient as they do not have any constraints on the number of the links that can becomputed for a VP element. In some cases, this leads to additional alignments which areerroneous. For example, the VPs can be extended, containing participles or in�nitivesthat do not correspond to any element of the parallel phrase. Such phrases can lead toan alignment between, for example, one English main verb and two Italian main verbs.The program does not verify which alignment is more probable (i.e., lexical parallelismof the aligned words) and should therefore exist in the resulting alignment. Instead, allpossible alignments are included in the computed VP alignment. The errors made bythe rule-based VP alignment are discussed in section 5.5.2.I built four SMT systems to examine whether the improved VP alignment leads to

94

the improvement of the pronominal subject translation between English and Italian.For each translation direction, two systems have been built: (i) a phrase-based SMTsystem using the baseline word alignment (M1), and (ii) a phrase-based SMT systemusing baseline word alignment combined with the rule-based VP alignment (Mmod). Inthe translation direction EN → IT, M1 has a BLEU score of 19.15 whereas the BLEUscore of Mmod is 18.18. In the opposite translation direction, the BLEU score of M1 is22.07 whereas the BLEU score of Mmod is 21.81 (cf. table 9, section 6.2). The BLEUscores are slightly worse for the Mmod systems. Manual examination of the generatedsentences though revealed that all systems produce nearly identical output leading tothe conclusion that the rule-based VP alignment does not have any impact on the (null)subject translation between English and Italian.However, the rule-based word alignment does change the translation parameters. The

number of the phrases (VPs) in which the English phrase contains the subject pronounwhereas the Italian VP has only the in�ected verb form is greater in Mmod than in M1(cf. section 6.3). In some cases, the translation probability of the correct translationpair is higher in Mmod than in M1 (cf. table 16, section 6.3). These observations leadto two important conclusions: (i) When translating Italian into English, Mmod is morelikely to generate the English subject pronoun; (ii) The probability of generating thecorrect Italian in�ected verb is higher in Mmod than in M1. Despite the fact that thereare di�erences in translation probabilities for the relevant translation phrases indicatingthat Mmod should generate better translations, an improvement in translation outputwas not observed.This can be explained by the fact that the translation probabilities of the phrases

consisting of a subject pronoun with a VP are relatively small. The verbs in suchphrases do not only occur with the pronominal subjects, but also with NP subjects. Insuch contexts, the verb (or VP) pairs are extracted without a subject pronoun. Theirlikelihood is high since they occur often and with a large number of di�erent NP subjects.When translating in�ected Italian verbs into English, it is therefore very likely that theverb is translated into the corresponding English verb. If the Italian verb has a NPsubject, this translation is correct. But if the Italian subject is dropped, an Englishsentence is generated that does not have a subject.I also noticed that some pronouns are more often correctly translated than others.

This is due to the relatively infrequent use of subject pronouns and the characteristicsof the corpus that I have been working with. 1st and 2nd person pronouns are used morefrequently than 3rd person pronouns. Observation of the generated sentences showedthat 1st person pronouns are correctly translated in most cases. 2nd person pronounsare problematic because of the ambiguity of Italian verbs and the characteristics of thecorpus (cf. examples (108) and (109), section 6.3 and table 3, section 2.4). 3rd personpronouns cause the most problems because the verbs they occur with can also haveNP subjects, as already mentioned above. When translating English into Italian, it isvery likely that the English phrase containing the subject pronoun (for example, 3rd

person pronoun) and the VP is not included in the translation table. The pronoun isthen translated separately from the VP leading to the generation of the Italian subjectpronoun. If this occurs in many subsequent sentences, one is faced with overgeneration

95

of pronouns in the Italian output.The problem regarding the small translation probabilities of the phrases consisting of

a pronominal subject and a VP cannot be solved by better (or perfect) word alignmentof the VPs with pronominal subjects. In fact, a parallel corpus is needed in which thepronouns occur much more frequently with a large number of di�erent verbs. Within aSMT system, this would increase their translation probabilities automatically. However,when translating Italian into English, a syntactic analysis of the Italian input is needed toderive whether the sentence has a pronominal or a NP subject. Given this information,the correct translation phrase can be chosen. The linguistic characteristics (person,number, gender) of Italian pronominal subjects can be determined if the Italian (null)subject is resolved which requires the access to previous sentences. In the oppositetranslation direction, it has to be decided whether the Italian subject pronouns haveto be dropped or expressed overtly. I noticed that some adjectives (for example, tutti(= all)) trigger the use of the overt Italian subject pronouns (cf. examples (24) - (27),section 2.4). However, since the use of the Italian subject pronouns has pragmaticreasons (cf. section 2.3), it is not trivial for a (statistical) machine translation systemto decide whether the subject pronoun should be realized overtly or be dropped.

7.2 Future work

In my thesis, I showed a method for aligning English VPs with pronominal subjectswith parallel Italian VPs. Improved alignment of English subject pronouns with Italianin�ected verbs did not result in the improvement of pronominal subject translationbetween English and Italian. In the following, I outline further possible methods toimprove the alignment of relevant phrases and the translation of pronominal subjectsbetween a null subject language Italian and a non-null subject language English.

Word alignment of English and Italian VPsThe method for the VP alignment that I presented in this thesis is based on an assump-tion that every English VP with a pronominal subject has a parallel Italian VP. Thisassumption does not always hold since the translations are not always literal. SomeEnglish VPs do not have an Italian counterpart or they correspond to an Italian phraseof an arbitrary type.The rule-based method for the VP alignment could be extended in order to han-

dle these cases. The method for identi�cation of a parallel Italian phrase should allowItalian phrases like PPs to be identi�ed as parallel phrases of English VPs. The transla-tion probabilities of English and Italian PoS sequences could be used to derive parallelphrases.The rules for the VP alignment handle only verbal elements of a VP, negation and

subject pronouns. In a case of a syntactic divergence in which the words of a phrasepair do not have a matching PoS, they remain unaligned. In some cases, this leads toremoval of correct alignment links. A deletion of such links could be avoided if theirreliability (for example, by using lexical translation probabilities and alignment of the

96

neighbouring words) would be computed.If we assume that syntactic phrases of di�erent types in English and Italian correspond

to each other, we need parse trees of English and Italian sentences in order to identifycorrectly the parallel phrases.In this work, the VP alignment rules have been applied only on VPs with a pronominal

subject. The rules could be as well used to align all VPs regardless of the type of a subject(pronominal or NP subject).

Translation direction IT → ENItalian sentences often do not have overtly expressed subjects. Their characteristics likeperson, number and gender can be derived from the in�ected verb and the precedingcontext (sentences). Statistical machine translation systems do not have an access tothe preceding context of a sentence that should be translated. The translation phraseswhich contain Italian in�ected verbs and the English language model should thereforelead to the generation of correct English subject pronouns. Correct phrase pairs couldbe learned from a corpus which contains many sentences with pronominal subjects (cf.section 6.4). But, if the correct translation phrases had high translation probabilities,we would become problems when the Italian source sentence contains a NP subject.If the in�ected verb generates an English pronoun, the English translation could havetwo subjects which would be incorrect. The information about the subject in a sourcesentence could be used to choose the correct translation phrase pair.Another approach to the problem of the generation of English pronominal subjects

is incorporation of pronoun resolution in the translation process. If the referent of anItalian omitted subject pronoun is determined, all characteristics (number, gender, etc.)of the missing pronoun could be derived and used to generate the corresponding Englishpronoun.

Translation direction EN → ITWhen translating English subject pronouns into Italian, it is important that the correctin�ected verb is generated. Furthermore, a decision has to be made whether the subjectpronoun should be generated or omitted. The use of an appropriate corpus as trainingdata (cf. section 6.4) could lead to an improvement of translation of English pronounsinto Italian (null) pronouns. A corpus containing many sentences with di�erent pronom-inal subjects would lead to an extraction of many di�erent English phrases consisting ofa subject pronoun and verbs with their Italian counterparts ((null subject) + in�ectedverb). This would though not solve the problems which concern the gender discrepancybetween English and Italian. To ensure the generation of a correct Italian subject pro-noun, it would be necessary to resolve the co-reference of an English pronominal subject.[Le Nagard & Koehn, 10] show a method for integration of co-reference resolution intophrase-based statistical machine translation.In some cases, Italian subject pronouns are expressed overtly. Statistical models could

learn such contexts (word or PoS sequences) in order to predict how the Italian subject

97

pronoun should be realized.

98

A Italian tag set

DJ adjective

ADV adverb (excluding -mente forms)

ADV:mente adveb ending in -mente

ART article

ARTPRE preposition + article

AUX:fin finite form of auxiliary

AUX:fin:cli finite form of auxiliary with clitic

AUX:geru gerundive form of auxiliary

AUX:geru:cli gerundive form of auxiliary with clitic

AUX:infi infinitival form of auxiliary

AUX:infi:cli infinitival form of auxiliary with clitic

AUX:ppast past participle of auxiliary

AUX:ppre present participle of auxiliary

CHE che

CLI clitic

CON conjunction

DET:demo demonstrative determiner

DET:indef indefinite determiner

DET:num numeral determiner

DET:poss possessive determiner

DET:wh wh determiner

NEG negation

NOCAT non-linguistic element

NOUN noun

NPR proper noun

NUM number

PRE preposition

PRO:demo demonstrative pronoun

PRO:indef indefinite pronoun

PRO:num numeral pronoun

PRO:pers personal pronoun

PRO:poss possessive pronoun

PUN non-sentence-final punctuation mark

SENT sentence-final punctuation mark

VER2:fin finite form of modal/causal verb

VER2:fin:cli finite form of modal/causal verb with clitic

VER2:geru gerundive form of modal/causal verb

VER2:geru:cli gerundive form of modal/causal verb with clitic

VER2:infi infinitival form of modal/causal verb

VER2:infi:cli infinitival form of modal/causal verb with clitic

VER2:ppast past participle of modal/causal verb

VER2:ppre present participle of modal/causal verb

99

VER:fin finite form of verb

VER:fin:cli finite form of verb with clitic

VER:geru gerundive form of verb

VER:geru:cli gerundive form of verb with clitic

VER:infi infinitival form of verb

VER:infi:cli infinitival form of verb with clitic

VER:ppast past participle of verb

VER:ppast:cli past participle of verb with clitic

VER:ppre present participle of verb

WH wh word

100

B English tag set (Penn Treebank Tagset)

CC Coordinating conjunction

CD Cardinal number

DT Determiner

EX Existential there

FW Foreign word

IN Preposition or subordinating conjunction

JJ Adjective

JJR Adjective, comparative

JJS Adjective, superlative

LS List item marker

MD Modal

NN Noun, singular or mass

NNS Noun, plural

NP Proper noun, singular

NNPS Proper noun, plural

PDT Predeterminer

POS Possessive ending

PRP Personal pronoun

PRP$ Possessive pronoun

RB Adverb

RBR Adverb, comparative

RBS Adverb, superlative

RP Particle

SYM Symbol

TO to

UH Interjection

VB Verb, base form

VBD Verb, past tense

VBG Verb, gerund or present participle

VBN Verb, past participle

VBP Verb, non-3rd person singular present

VBZ Verb, 3rd person singular present

WDT Wh-determiner

WP Wh-pronoun

WP$ Possessive wh-pronoun

WRB Wh-adverb

101

C English subject pronoun occurrences

In the process of computing VP alignment, clauses in the English part of the parallelcorpus (cf. chapter 5.2) are identi�ed and checked whether they contain a subjectpronoun. I counted subject pronoun occurrences and clauses in which the subject isnot pronominal. The counting results are shown in table 21. Entire corpus consists of749,646 sentences which can be divided into 1,254,086 clauses.34

I we you he she it they NP14% 15% 2% 0.8% 0.2% 9% 0.2% 54%

Table 21: Pronoun occurrence in English

Half of the corpus clauses have NP subjects. In the context of dealing with subject pro-nouns, these sentences (its verbs) cannot be used to extract English verbs and pronounswith their correspondences in Italian. In fact, they contribute to the probabilities ofphrases consisting only of verbs without a subject pronoun.

34Missing 5% are due to false recognition of subjects.

102

List of Tables

1 Statistics on referents of 3rd person subjects in Italian . . . . . . . . . . . 172 Occurrence of SUBJ in Italian . . . . . . . . . . . . . . . . . . . . . . . . 173 Occurrence of null-SUBJ in 93 observed clauses . . . . . . . . . . . . . . 174 Evaluation of GIZA++ word alignment for English and Italian . . . . . . 355 Example phrase translation probabilities for io sono . . . . . . . . . . . . 356 Evaluation of the VP alignment . . . . . . . . . . . . . . . . . . . . . . . 667 Evaluation of VP alignment for di�erent IT-VP identi�cation approaches 798 Evaluation of di�erent VP alignments . . . . . . . . . . . . . . . . . . . . 809 BLEU scores of the SMT systems for EN ↔ IT . . . . . . . . . . . . . . 8410 Translation probabilities for so into (I) know . . . . . . . . . . . . . . . . 8511 Top �ve translation phrases for so . . . . . . . . . . . . . . . . . . . . . . 8512 Translation probabilities of che parli into that (you/she) speak(s) . . . . 8613 Translation probabilities of parli into that (you) speak/talk . . . . . . . . 8614 Translation probabilities of hanno into (they) have . . . . . . . . . . . . 8715 Translation probabilities of pensano into (they) think . . . . . . . . . . . 8716 Translation probabilities of i can into (io) posso . . . . . . . . . . . . . . 8917 Translation probabilities of we know into (noi) sappiamo . . . . . . . . . 8918 Top �ve translation phrases for we know . . . . . . . . . . . . . . . . . . 9019 Translation probabilities of he into egli, lui and ha . . . . . . . . . . . . 9020 Top �ve translation phrases for he has . . . . . . . . . . . . . . . . . . . 9121 Pronoun occurrence in English . . . . . . . . . . . . . . . . . . . . . . . . 102

103

List of Figures

1 Main program: correct_align . . . . . . . . . . . . . . . . . . . . . . . . 452 System components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Alignment check and improvement . . . . . . . . . . . . . . . . . . . . . 474 Alignment of I would ask you to request and la prego di chiedere . . . . . 485 Search for the best Italian VP . . . . . . . . . . . . . . . . . . . . . . . . 506 Incorrect base alignment of if you wish and se lo desidera . . . . . . . . . 547 Correct alignment of if you wish and se lo desidera . . . . . . . . . . . . 558 Alignment of I can tell you and posso risponderle . . . . . . . . . . . . . 569 Alignment of it actually passes and esso stesso approva . . . . . . . . . . 5710 Alignment of I would say and volendo dire . . . . . . . . . . . . . . . . . 5711 Alignment of I have said and aver detto . . . . . . . . . . . . . . . . . . 5712 Incorrect base alignment of I feel and ritengo . . . . . . . . . . . . . . . 5713 Correct base alignment of I feel and ritengo . . . . . . . . . . . . . . . . 5814 Alignment of you enjoyed and abbiate trascorso . . . . . . . . . . . . . . 5915 Alignment of you have requested and avete chiesto . . . . . . . . . . . . . 5916 Alignment of we were elected and sono stati eletti . . . . . . . . . . . . . 6017 Complete alignment of they had and di avere . . . . . . . . . . . . . . . . 6018 Alignment of you have requested and avete chiesto . . . . . . . . . . . . . 6119 Alignment of you have requested and chiedevate . . . . . . . . . . . . . . 6120 Alignment of I would like to say and vorrei dire . . . . . . . . . . . . . . 6221 Alignment of we do not adhere and noi non rispettiamo . . . . . . . . . . 6222 Alignment of I suggest to present and raccomando di presentare . . . . . 6323 Alignment of I shall do and seguirò . . . . . . . . . . . . . . . . . . . . . 6324 Alignment of we have upheld and abbiamo sostenuto . . . . . . . . . . . 6425 Alignment of you have suggested and lei propone (= you proposed) . . . 6426 Alignment of he is to go and verrà messo . . . . . . . . . . . . . . . . . . 6427 Alignment of you hear and ascoltando . . . . . . . . . . . . . . . . . . . 6528 Alignment comparison: I accept and lo accetto . . . . . . . . . . . . . . . 6729 Alignment comparison: it will (, I hope,) be examined and sarà esaminata 6730 Alignment comparison: I can (,therefore,) give and pertanto può contare su 6831 Alignment comparison: we (then) proceed and poi di procedere . . . . . . 6832 Alignment comparison: I have (thus) proposed and , ho proposto . . . . . 6933 Alignment comparison: they do not (properly) re�ect and esso non ri�etterà 6934 Alignment comparison: I might be allowed to give and mi permettesse di

rilasciare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7035 Alignment of we agree (...) support and condividiamo (...) appoggiamo . 7436 Alignment of it may have been and sia stato . . . . . . . . . . . . . . . . 7437 Alignment of they have been answered and avessero ottenuto (risposta) . 7538 Alignment of you were (unable to attend) and lei non ha potuto partecipare 7739 Alignment of you were unable to attend and lei non ha potuto partecipare 7740 Alignment of I come and provengo . . . . . . . . . . . . . . . . . . . . . 8041 Alignment of you know and assume . . . . . . . . . . . . . . . . . . . . . 80

104

References

[Baroni et al., 04] Baroni, M. et al. Introducing the "la Repubblica" corpus: A large, an-notated, TEI(XML)-compliant corpus of newspaper Italian in Proceedings of LREC2004, Lisbon, Portugal, 2004

[Bennis, 06] Bennis, H. Agreement, Pro, and Imperatives in Ackema, P.; Brandt, P. etal. (eds.) Arguments and Agreement, Oxford University Press, New York, 2006

[Brown et al., 03] Brown, P. F. et al. The Mathematics of Statistical Machine Transla-tion: Parameter Estimation, Computational Linguistics, 1993

[Butt, 94] Butt, M. Machine Translation and Complex Predicates, Konvens, Wien, 1994

[Charniak, 00] Charniak, E. A Maximum-Entropy-Inspired Parser in Proceedings of theconferences and Proceedings of the ANLP-NAACL 2000 Student Research Work-shopSeattle, USA, 2000

[Duranti, 80] Duranti, A. Sull' uso dei pronomi tonici nelle conversazioni in Berrettoni,P. (ed.) Problemi di analisi linguistica, Rome, 1980

[Duranti, 84] Duranti, A. The social meaning of subject pronouns in Italian conversationin Van Dijk, T. (ed.) Text. An interdisciplinary journal for the study of discourse,Mouton publishers, 1984

[Goldwater & McClosky, 05] Goldwater, S.; McClosky, D. Improving Statistical MTthrough Morphological Analysis in Proceedings of Human Language Technology Con-ference and Conference on Empirical Methods in Natural Language Processing, Van-couver, 2005

[Haegeman, 96] Haegeman, L. Introduction to Government & Binding Theory, 2nd edi-tion, Blackwell Publishing, 1996

[Huang, 84] Huang, C.T.J. On the distribution and reference of empty pronouns inRoberts, I. (ed.) Comparative grammar. Critical concepts in linguistics, Routledge,2007

[Koehn et al., 03] Koehn, P.; Och, F. J.; Marcu, D. Statistical phrase based translationin Proceedings of the Joint Conference on Human Language Technologies and theAnnual Meeting of the North American Chapter of the Association of ComputationalLinguistics (HLT-NAACL), 2003.

[Koehn, 05] Koehn, P. Europarl: A Parallel Corpus for Statistical Machine Translation,MT Summit, 2005

[Koehn et al., 07] Koehn, P. et al. Moses: Open Source Toolkit for Statistical Ma-chine Translation, Annual Meeting of the Association for Computational Linguistics(ACL), demonstration session, Prague, Czech Republic, June 2007

105

[Koehn, 09] Koehn, P. Statistical machine translation, Cambridge University Press, 2009

[Le Nagard & Koehn, 10] Le Nagard, R.; Koehn, P. Aiding Pronoun Translation withCo-Reference Resolution in Proceedings of the Joint 5th Workshop on StatisticalMachine Translation and MetricsMATR, Uppsala, Sweden, 2010

[Nakaiwa & Ikehara, 92] Nakaiwa, H.; Ikehara, S. Zero pronoun Resolution in Japaneseto English Machine Translation System using Verbal Semantic Attributes in Ap-plied Natural Language Conferences. Proceedings of the third conference on Appliednatural language processing, Trento, Italy, 1992

[Och & Ney, 03] Och, F. J.; Ney, H. A Systematic Comparison of Various StatisticalAlignment Models in Computational Linguistics, vol. 29, num. 1, MIT Press, 2003

[Papineni et al., 02] Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. BLEU: A Method forAutomatic Evaluation of Machine Translation in Proceedings of the 40th AnnualMeeting of the Association for Computational Linguistics (ACL), Philadelphia, 2002

[Peral & Ferrández, 03] Peral, J.; Ferrández, A. Translation of Pronominal Anaphorabetween English and Spanish: Discrepancies and Evaluation in Journal of Arti�cialIntelligence Research 18, 2003

[Pianta & Bentivogli, 04] Pianta E.; Bentivogli, L. Knowledge Intensive Word Align-ment with KNOWA, Proceedings of the 20th international conference on Computa-tional Linguistics, Geneva, Switzerland, 2004

[Rizzi, 82] Rizzi, L. Negation, Wh-movement and the null subject parameter in Compar-ative Grammar, Volume II, The Null-Subject Parameter, Roberts, I. (ed.), Rout-ledge, 2007

[Roberts, 07] Roberts, I. Introduction. The Null-Subject Parameter in Roberts, I. (ed.)Comparative grammar. Critical concepts in linguistics, Routledge, 2007

[Schmid, 95] Schmid, H. Probabilistic Part-of-Speech Tagging Using Decision Trees inProceedings of International Conference on New Methods in Language Processing,1995

[Schmid, Baroni et al., 2007] Schmid, H. et al. The enriched TreeTagger System in In-telligenza Arti�ciale IV-2, 2007

[Stolcke, 02] Stolcke, A. SRILM � An Extensible Language Modeling Toolkit in Proc.Intl. Conf. on Spoken Language Processing, vol. 2, Denver, 2002

[Tsao, 77] Tsao, F. A Functional Study of Topic in Chinese: The First Step towardDiscourse Analysis, Dissertation, USC, Los Angeles, 1977

[Vanelli, Renzi, et al., 06] Vanelli, L.; Renzi, L.; Benincà, P. A typology of romance sub-ject pronouns in Roberts, I. (ed.) Comparative grammar. Critical concepts in lin-guistics, Routledge, 2007

106

[Zanchetta & Baroni, 05] Zanchetta, E.; Baroni, M. Morph-it! A free corpus-based mor-phological resource for the Italian language in Corpus Linguistics 2005, Universityof Birmingham, Birmingham, UK, 2005

107