TOWARDS A DISCOURSE RESOURCE FOR ITALIAN

UNIVERSITÀ DEGLI STUDI DI PAVIA

Facoltà di Lettere e Filosofia Corso di Laurea Specialistica in Linguistica Teorica e Applicata

TOWARDS A DISCOURSE RESOURCE FOR ITALIAN: DEVELOPING AN ANNOTATION SCHEMA FOR

ATTRIBUTION

Relatore: Prof.ssa Irina Prodanof

Correlatore: Dott.ssa Claudia Soria Correlatore: Prof.ssa Cecilia Maria Andorno

Tesi di: Silvia Pareti

Anno Accademico 2008/2009

Nobody believes the official spokesman...

but everybody trusts an unidentified source.

Ron Nessen

- ii -

Abstract This thesis investigates the complex phenomenon of attribution and addresses the

issue of annotating attribution relations, developing, by means of a pilot study, a

possible annotation schema to be applied to the Italian Syntactic Semantic

Treebank corpus of newspaper articles (ISST).

Attribution is the relation occurring between assertions but also e.g. beliefs,

feelings, intentions, and the agents they belong to (e.g. The minister says that

taxes will rise in 2010). As this relation deeply affects the way we perceive

information, this should not be considered in isolation. It is fundamental to

recognise attribution in order to deal with the reliability of information and with

opinions.

The development of an annotation schema for attribution aims at providing

a resource in which information is overtly linked to its source. Having this

annotated resource could serve a number of purposes especially in the fields of

Information Retrieval, Multi Perspective Question Answering and Opinion Mining.

To date, attribution has only been annotated when associated with a discourse

connective or one of its arguments (Prasad et al., 2007), or only at the document,

sentence (Skadhauge and Hardt, 2005) or even word level (Wiebe, 2002) thus

only partially approaching the phenomenon.

The present study addresses attributions independently and regarding them

also as a discourse phenomenon. After analysing the features and issues

connected to attribution, e.g. scope definition, nested attributions, factuality of the

relation and co-reference resolution, an annotation schema will be proposed

following the identification of a set of possible ‘attribution devices’. To test its

feasibility and accuracy against data, a pilot annotation will be performed on a

portion of the ISST corpus. This will allow the definition of annotation guidelines

and the identification of additional issues remained unnoticed at the theoretical

level. In order to select a suitable tool to perform the pilot annotation, several

available annotation tools will be compared.

This thesis not only constructively contributes to the development of a

discourse resource for Italian, but also approaches attribution relations from a new

independent perspective raising problematic issues and providing a deeper

- iii -

account of the phenomenon. Further developments of the project should perform a

complete pilot annotation of all the type of attribution and features intended to be

included and develop, together with the appropriate tool, a final annotation schema

to be applied to the whole corpus.

- iv -

Acknowledgments

It might seem banal, however every time a challenging project is over it is useful to

look back and consider who made it possible. Not only to in order to recognise

other people’s merits and efforts, but especially to realise that we have not been

alone. It is because of that very feeling that every time an endeavour finally comes

to its conclusion, I can once again think about starting a new one. I can start with

remembering how much of what I am I owe to the Erasmus Scheme and the

chances it gave me first as a student and recently as an intern to study and

research at two amazing UK universities: Reading University and the University of

Edinburgh. First of all I would like to thank my supervisor at Edinburgh, Bonnie

Webber, for the unforgettable opportunity and the many hours she devoted to

listen about my progresses and my many doubts, every time with a solution to

propose or a name possibly having it to suggest. Echoes of the enlightening

conversations I had there with Theresa Wilson, Janyce Wiebe, Jean Carletta,

Nicoletta Calzolari, John Niekrasz, Katja Markert, and other colleagues, can be

found in this thesis as they were fundamental in shaping my choices and widening

my perspective concerning the topic. A special acknowledgement is also due to

Rashmi Prasad who has patiently answered all my e-mails providing me with

material and clarifications about the PDTB and precious suggestions. Constructive

were also the contacts I had with Roser Saurí, Tommaso Caselli and Massimo

Poesio. Lastly, I cannot forget the contribution of Jasmine to the revision of the

thesis and the technical and loving support Gregor unfailingly provided.

- v -

Contents

Abstract.................................................................................................................. ii

Acknowledgments................................................................................................ iv

List of Figures and Tables ................................................................................... ix

List of Figures ..................................................................................................... ix

List of Tables....................................................................................................... ix

1 Introduction ........................................................................................................ 1

1.1 An Independent Approach to Attribution........................................................ 3

1.2 Methodology ................................................................................................. 4

1.3 Terminology .................................................................................................. 5

1.4 Outline of the Thesis ..................................................................................... 5

2 Discourse and Attribution.................................................................................. 7

2.1 What is Discourse?....................................................................................... 7

2.1.1 Definition ................................................................................................. 7

2.1.2 Theories of Discourse Coherence and Cohesion.................................... 7

2.1.3 Constituency vs. Dependency................................................................. 9

2.2 Discourse Annotation Projects .....................................................................12

2.2.1 RST-DT ................................................................................................. 12

2.2.2 The Penn Discourse TreeBank PDTB ................................................... 13

2.2.3 Other Projects ....................................................................................... 15

2.3 Attribution.....................................................................................................16

2.3.1 Towards a Definition of Attribution ......................................................... 17

2.3.2 Are Attribution Relations a Discourse Phenomenon?............................ 18

2.4 Related Studies............................................................................................22

2.4.1 GraphBank ............................................................................................ 22

2.4.2 Opinion Corpus ..................................................................................... 23

2.4.3 PDTB - The Penn Discourse TreeBank................................................. 24

- vi -

2.5 Summary .....................................................................................................28

3 An Analysis of Attribution................................................................................ 30

3.1 The Components of Attribution ....................................................................30

3.1.1 The Source............................................................................................ 31

3.1.2 The Content........................................................................................... 34

3.1.3 Elements Functioning as Cue................................................................ 37

3.2 Some Issues ................................................................................................45

3.2.1 Nested Attributions ................................................................................ 45

3.2.2 Source of the Source............................................................................. 49

3.2.3 Multiple Sources, Contents, Cues ......................................................... 51

3.2.4 Co-reference Resolution ....................................................................... 52

3.2.5 Scope Definition .................................................................................... 55

3.3 Summary .....................................................................................................57

4 Features to Include in the Annotation ............................................................ 58

4.1 Type .............................................................................................................58

4.1.1 Assertion ............................................................................................... 58

4.1.2 Belief ..................................................................................................... 59

4.1.3 Fact ....................................................................................................... 59

4.1.4 Eventuality............................................................................................. 60

4.1.5 Issues Concerning Type Definition ........................................................ 61

4.2 Source .........................................................................................................65

4.2.1 Writer..................................................................................................... 65

4.2.2 Arbitrary................................................................................................. 66

4.2.3 Other ..................................................................................................... 69

4.3 Factuality .....................................................................................................70

4.3.1 Factual .................................................................................................. 72

4.3.2 Non-factual ............................................................................................ 73

4.4 Scopal Change ............................................................................................76

4.4.1 Scopal Polarity ...................................................................................... 76

4.4.2 Other Elements Affecting the Factuality ................................................ 79

- vii -

4.5 Summary .....................................................................................................80

5 Performing a Pilot Annotation......................................................................... 81

5.1 Corpus .........................................................................................................82

5.1.1 ISST Architecture .................................................................................. 82

5.1.2 Subcorpus Selection ............................................................................. 85

5.2 Tool Selection ..............................................................................................87

5.2.1 Requirements ........................................................................................ 88

5.2.2 Comparison of Available Tools .............................................................. 90

5.2.3 Selection and Tool Specifics.................................................................. 97

5.3 Setting MMAX2............................................................................................98

5.3.1 Scheme ............................................................................................... 102

5.3.2 Customization ..................................................................................... 105

5.3.3 Style .................................................................................................... 106

5.4 Feasibility of the Schema and Issues ........................................................107

5.5 Summary ...................................................................................................108

6 Annotation Schema and Guidelines ............................................................. 110

6.1 Text Spans Selection .................................................................................112

6.1.1 Source Span........................................................................................ 113

6.1.2 Cue Span ............................................................................................ 115

6.1.3 Content Span ...................................................................................... 117

6.1.4 Supplement ......................................................................................... 119

6.2 Feature Annotation Guidelines...................................................................120

6.2.1 Type Attribute ...................................................................................... 121

6.2.2 Factuality Attribute............................................................................... 122

6.2.3 Scopal Change Attribute...................................................................... 123

6.2.4 Source Type Attribute .......................................................................... 124

6.3 Collecting a List of Italian Cues..................................................................124

6.3.1 Extracting Verb Cues from the PDTB.................................................. 126

6.4 Summary ...................................................................................................127

- viii -

7 Conclusion...................................................................................................... 129

7.1 Future Work ...............................................................................................129

7.1.1 And Beyond......................................................................................... 130

Bibliography: ..................................................................................................... 132

Abbreviations and Acronyms........................................................................... 137

Appendix 1 – MMAX2 Code .............................................................................. 138

Appendix 2 – Italian Attribution Cues.............................................................. 140

Appendix 3 – PDTB Verb Cues......................................................................... 144

- ix -

List of Figures and Tables

List of Figures

Figure A - Reported news example ......................................................................... 2 Figure B - RST schemas ......................................................................................... 9 Figure C - - (L-TAG) Tree examples (Cristea and Webber, 1997) ........................ 12 Figure D - Sense classification of discourse connectives in the PDTB.................. 14 Figure E - Graphic extra-linguistic attribution......................................................... 21 Figure F - Newspaper article source ..................................................................... 37 Figure G - Nested attribution schema.................................................................... 46 Figure H - Truth values of a nested content........................................................... 48 Figure I - Design Process...................................................................................... 81 Figure J - ISST orthographic level (sole002) ......................................................... 83 Figure K - ISST morpho-syntactic level (sole002) ................................................. 84 Figure L - ISST syntactic constituent level (sole002)............................................. 84 Figure M - ISST table format ................................................................................. 86 Figure N - GATE annotation environment.............................................................. 91 Figure O - GATE annotation exported in XML ....................................................... 92 Figure P - Knowtator annotation environment ....................................................... 93 Figure Q - Knowtator annotation exported in XML ................................................ 94 Figure R - MMAX2 Project Wizard ........................................................................ 99 Figure S - MMAX2 Base Data (ISST cs001) ....................................................... 100 Figure T - The annotation of cue, content and source as separate levels ........... 101 Figure U - MMAX2 Annotation window................................................................ 103 Figure V - MMAX2 Annotation window (attributes) .............................................. 103 Figure W - MMAX2 Annotation of relations.......................................................... 105 Figure X - Nested attributions visible through handles ........................................ 106 Figure Y - Attribution relation components............................................................111 Figure Z - Annotation, text spans selection.......................................................... 112 Figure AA - Annotation, elements which could function as a markable................ 113 Figure BB - Annotation, attributes selection......................................................... 120

List of Tables

Table 1 - Factuality values (Saurí and Pustejovsky, 2008) .................................... 72 Table 2 - N. of articles selected per section........................................................... 86 Table 3 - Knowtator/ MMAX2 feature comparison ................................................. 97 Table 4 - Annotation schema features. ................................................................ 120 Table 5 - Factuality and Scopal change values assignment ................................ 123

1 Introduction

- 1 -

1 Introduction

Discourse relations represent a fundamental aspect of discourse understanding

and generation. Therefore research in many areas, such as Information Extraction,

Discourse Generation and Question Answering, would benefit from a discourse

annotated corpus as a basis for their studies.

The aim of this thesis is to contribute towards providing Italian with

complete linguistic resources in particular with designing and testing the addition

of a discourse level of annotation to the ISST corpus, a multi-level annotated

corpus of Italian newspaper texts. This already consists of 5 levels of annotation:

orthographic, morpho-syntactic, syntactic (constituents), syntactic (dependencies)

and semantic. The addition of a layer for discourse annotation comes as a natural

development of the ISST corpus.

Most of the work in this frame, to date, concentrates on analysing and

annotating discourse connectives or anaphoric relations. For the purpose of the

present study, however, these issues will not be addressed and the focus will be

on attribution relations. This topic is especially relevant for research dealing with

Information Retrieval, Multi-Perspective Question Answering and Opinion Mining.

Tools able to discern information according to the relevance of its source or

to identify different opinions with regards to a given topic would dramatically

improve the quality of the information we are constantly exposed to. People more

and more refer to the internet as a source of information and knowledge

interrogating search engines instead of encyclopaedias or experts. A number of

projects, last the Microsoft search engine ‘Bing’, are trying to outperform ‘Google’

and break its monopoly with scarce success as they introduce interesting small

changes without remarkably improving the reliability of responses to our queries.

Search engines usually classify the source only at the macro-level, i.e. the

webpage a certain text or information was taken from.

The urge for retrieving answers quickly does not always allow users to take

the context in which the information was found into consideration or to address the

troublesome question: Where does this knowledge come from? Quite often for

example we hear people supporting their views with stating that they read

1 Introduction

- 2 -

something about them on the internet or even that ‘internet says it’. This

generalisation is also due to the difficulty of linking the information to the exact

source, often hidden by several levels of attribution all nested one in another like a

Matryoshka doll.

The practice of reporting information is particularly pervasive in the

journalistic field and especially in news reviews where what is stated is always

second hand if it does not originate from even further away. In the example below

(Figure A), on the website First Bell it is reported that the UK has the largest

gender gap in science achievement. This, however, according to the UK’s

Telegraph which in turn reports a study from the OECD whose data is taken from

the Program for International Student Assessment (2006).

Figure A - Reported news example

http://links.mkt753.com/servlet/MailView?ms=NDEwMjIzOAS2&r=MTQ3NzY4ODQ3OAS2&j=MTIyNTgxMzg1S0&mt=1&rt=0

In the last few years the Web has become the indistinct repository of all human

knowledge. However, although it surely is the shallowest source of the data we

learn from it, it is never the only one and knowing all the passages a certain

statement has gone through is fundamental, as it is e.g. to know its temporal

anchor, in order to verify its veracity, understand and interpret it. Just consider the

example (1) below:

(1) According to The Times the President wants to buy the Amazon Forest and

turn the trees into toothpicks.

1 Introduction

- 3 -

This intentions attributed to the President seems to come from a trustworthy

source, ‘The Times’, and would hopefully determine immediate reactions at least

from the environmentalists. But what if this statement was part of another

attribution relation as in the paragraph that follows (2)?

(2) “According to The Times the President wants to buy the Amazon Forest and

turn the trees into toothpicks.” The comedian pronounced these words,

joking about the President’s disregard for environmental issues.

A last remark concerns the utility and importance of developing such a project in a

language other than English. First of all, because findings and results proceeding

from studies employing the English language cannot be always and entirely valid

for other languages. Secondly, the importance and life of a language depends also

on these efforts to make it available for every possible use. Having language

resources for Italian means providing support for studies and research and allow

the development of tools specific for this language, thus enabling its speakers to

rely on it for the full range of their needs. Lastly, developing resources in several

languages provides precious data for inter-linguistic comparison, thus making it

possible to identify aspects which are common and aspects instead peculiar to

each language.

1.1 An Independent Approach to Attribution

Being able to automatically link together attributed material and its source would

represent a big advantage for a number of tasks. At present, this is still not

possible. A manually annotated corpus for attribution is surely not the solution,

however, it represents an important step towards it. Studies aiming at developing

tools for the recognition of attribution would in fact need a complete description of

how the phenomenon functions and is expressed, together with an already

annotated corpus to test their reliability.

Although attribution relations have already been annotated in a few other

projects (Wolf and Gibson, 2005; Wiebe, 2002; Prasad et al., 2007), a systematic

and independent account of the phenomenon is still lacking. Studies aiming at

1 Introduction

- 4 -

capturing the complexity of discourse relations recognise the importance of

attribution, but reserve a rather secondary role for it (Wolf and Gibson, 2005;

Prasad et al., 2007). Other approaches instead take the distance from discourse

and assume a more independent perspective or pair attribution with subjective

language (Wiebe, 2002). None of them, however, completely investigate attribution

as they limit the annotation to only some of the attribution levels: word, clause,

sentence or document.

In the present project, attribution relations will be investigated as the starting

point towards the construction of a discourse resource for Italian and not as an

additional feature of it. Moreover, all levels of attribution will be considered and

annotated. This way of proceeding will allow exploring the topic independently

from other discourse relations and reaching a deeper understanding and a broader

account of the phenomenon.

1.2 Methodology

In order to annotate the corpus for attribution, some preliminary work needs

to be carried out. First of all, attribution relations have to be analysed in order to

identify their characteristics and spot issues which represent an obstacle to the

annotation. Afterwards, possible solutions to these problems will be proposed and

an annotation schema outlined. This will be then applied to a section of the corpus

with the help of an annotation tool.

The tool has been selected after comparing and testing several available

software applications. The choice of the annotation tool poses constraints to the

annotation schema as its limited functionality determines what is feasible and what

not (e.g. some tools do not allow the selection of overlapping text spans). Although

ideally the tool is determined by the annotation schema and should be developed

according to its requirements, this was at this stage not realistic.

Having to rely on an existing tool, the initial annotation schema proposed

will therefore be adapted to the selected tool. Performing the pilot annotation will

rise additional issues and determine new changes to the annotation schema. This

will finally reach its final stage, a proposal for the annotation of attributions, which

should be applicable to the rest of the corpus with the help of annotators, leading

1 Introduction

- 5 -

to presumably good interannotator-agreement.

1.3 Terminology

Before moving on, it is opportune to briefly introduce some terminology employed.

Although ‘text’ “is used in linguistics to refer to any passage, spoken or written, of

whatever length, that does form a unified whole” (Halliday and Hasan, 1976:1) , as

the type of texts within the scope of this study are solely newspaper articles, this

will refer to written language only. The account for attribution provided in this thesis

should hold also for the spoken language, however, further investigations are

necessary in order to determine to what extent this is true.

When generally discussing attribution, the term ‘writer’ will be mostly

employed to refer to both the writer and the speaker of the text.

Discourse is often characterised as a coherent text, as opposed to text

lacking a semantic unity. As incoherent texts will not be taken into consideration,

both ‘discourse’ and ‘text’ will be generally used to refer to a coherent unit of

language.

The lexical material signalling an attribution relation will be mostly identified

as ‘cue’ or ‘text anchor’

1.4 Outline of the Thesis

In the second chapter, the framework of discourse studies will be briefly introduced

together with a survey of discourse annotation projects. Afterwards, attribution will

be defined and projects involving its annotation reviewed.

The third chapter presents the phenomenon of attribution and provides an

analysis of its constitutive elements, with particular attention to the elements

expressing them in the text. Some of the most problematic issues connected to

attribution relations and their annotation are also investigated.

A first annotation schema proposal is described in chapter four. The

description focuses on the features to include in the annotation. These attributes

and their possible values are carefully analysed and described with the help of

examples.

1 Introduction

- 6 -

The fifth chapter illustrates the stages towards performing a pilot project in

order to test the feasibility of the schema on the corpus. These include the

specification of the tool requirements, the analysis and selection of the most

suitable tool among the ones currently available and the setting of the selected

tool. Afterwards, some additional issues or issues identified through the pilot

annotation are also presented.

In the sixth chapter the final annotation schema proposed for the annotation

of attribution relations is briefly summarised and guidelines concerning the

annotation are provided, as they have been adopted for the pilot, in order to

facilitate the selection of the relevant text spans and the assignment of the

attribute values.

In the last chapter conclusions are drawn and future developments

discussed.

2 Discourse and Attribution

- 7 -


2.1 What is Discourse?

2.1.1 Definition

Aristotle already understood it and warned us in his Metaphysics that “The whole

is more than the sum of its parts.” This also holds for such ‘wholes’ like texts,

where the meaning deriving from the juxtaposition of clauses, as pointed out by

Moore and Wiemer-Hastings (2003), may not coincide with the meaning of the

individual clauses and may imply more than that. Discourse could therefore be

defined as ‘propositions in context’ (Péry-Woodley and Scott, 2007).

Units of language are usually organised in a coherent way and researchers

agree that coherent text has a structure and that understanding the way it

functions is fundamental for the understanding of discourse (Grosz and Sidner,

1986, Hobbs, 1985). This structure needs to be taken into consideration when

dealing with natural language generation but also with tasks such as co-reference

resolution, temporal relations and attribution relations. Coherency not only

depends on the relations holding between strings but has also to do with extra-

linguistic components such as the writer/speaker, the recipient, the knowledge

they share and the communicative situation.

Another concept strongly connected to coherency and contributing to it is

that of cohesion. Cohesive elements are linguistic devices employed to signal

connections between text units. Coherency and cohesion will be both employed in

the next section as some approaches semantically ground discourse relations,

therefore focus on the elements giving coherency to the discourse, while other try

to account for the cohesive means by which this coherence is linguistically

expressed.

2.1.2 Theories of Discourse Coherence and Cohesion

“Between sentences, there are no structural relations, and this is where the study

of cohesion becomes important.” (Halliday and Hasan, 1976:146).


- 8 -

Two metaphors are usually employed by theories of coherence: that of focus,

which holds between entities referred to in a text and can involve more than two

text spans; and that of relation, binary in nature and linking instead sections of

text. Different theories have taken different approaches to discourse relations

which Knott (1996) describes as ‘deep’ and ‘surface structure’. The ‘deep structure’

theories investigate discourse relations identifying the semantic relations which

underlie ‘surface syntactic relations’ (Grimes, 1975). ‘Surface structure’

approaches, on the contrary, consider the ‘deep’ semantic relations less important

and characterise discourse relations from the outside, identifying possible

resources signalling them on the ‘surface structure’ (Halliday and Hasan, 1976).

The types of structure usually employed by computational models of

discourse processing are three. The informational structure, “the relation

between the information conveyed in consecutive elements of a coherent

discourse” (Moore and Pollack, 1992:537) which deals with semantic relations,

e.g. the causal relation. The attentional structure (Grosz and Sidner, 1986)

determines instead the ‘focus’ or ‘centre’ of attention: the information or entities

which are mostly relevant at any given point. Another type of structure is the

intentional structure which deals with the intentions of the speaker/writer and

therefore with what they are trying to accomplish through the communicative act.

This kind of structure underlies Grosz and Sidner’s (1986) concept of

discourse relation. In their theory, relations apply to discourse segments (DS) and

combine them in larger DSs. The ‘intention’ relations are those of ‘dominance’,

when the satisfaction of the subordinate segment concur to the satisfaction of the

dominant one, and ‘satisfaction-precedes’, in which the satisfaction of one

segment precedes the satisfaction of another segment and together they concur to

the satisfaction of a third dominant one. Discourse segments are therefore

organised in a hierarchical structure of goals and sub-goals. Considering

discourse as a composite of linguistic structure, intentional structure and

attentional state, Grosz and Sidner also account for the interaction between

relation and focus and they present every discourse segments as having also a

focus space determined by dominance relations.

Two additional structures should also be added to the three already

mentioned. One is the information structure, which has to do with the concepts


- 9 -

of theme and rheme, the former being the part connected to the rest of the

discourse and the latter the new information which is introduced about it. The

other one is the rhetorical structure, which defines a set of rhetorical relations

that can connect consecutive discourse elements.

Rhetoric relations are the core of the RST (Rhetorical Structure Theory)

formulated by Mann and Thompson (1988). Rhetorical relations (RR) are

functionally defined as the effect the writer intends to achieve and are expressed

by linguistic devices. RRs entail the concept of nuclearity, that is the centrality of

the span with respect to the writer’s purposes. Nucleus and satellite relations, and

less commonly multinuclear relations, structure the text and can be exemplified by

schema applications (Figure B), which can then be mapped onto text. A

hierarchical system of schema applications produces a Rhetorical Structure tree.

For a text to be coherent, it should be possible to represent it with a single RS

tree.

Figure B - RST schemas

2.1.3 Constituency vs. Dependency

Webber (2006) argues that the approaches to discourse structures can also be

grouped according to the concepts of constituency and dependency. RST

approach is based on constituency, the idea of linguist units as “parts within

parts”, having “specific roles or functions” (Webber, 2006:340), and considers this

as the only basis for discourse relations. Their instantiated schemas represent the

constituency structure and correspond to discourse relations between consecutive

spans (i.e. clauses or projections of instantiated schemas).

Also based on constituency is Polanyi's Linguistic Discourse Model (LDM).

This is similar to the RST, however, it separates discourse structure, formed by a

hierarchy of discourse units, from discourse interpretation. Their discourse parse

tree (DPT) can be described by a context-free grammar consisting of 3 re-write

nucleus satellite nucleus nucleus

text span

relation


- 10 -

rules: an N-ary branching rule for discourse coordination, a binary branching rule

for discourse subordination and an N-ary branching rule with sisters related by a

logical or rhetorical relation and contributing to the interpretation of their parent

node. The DPT is right open, which means that every discourse unit resuming an

interrupted constituent also closes it off, thus making it impossible for any

subsequent coordinate or subordinate discourse unit to attach to it. This claim,

which does not allow for incrementation, is similar to the Intention Stack

mechanism depicted by Grosz and Sidner (1986) and the notion of Right Frontier

in Webber (1988).

Another approach to the structure of discourse is that of relating discourse

cohesion to dependency, which can be of three kinds: syntactic, semantic and

anaphoric. In Halliday and Hasan (1976) this is solely anaphoric dependency.

Their idea of cohesion is that of a part whose interpretation requires the

interpretation of another part to be enabled. Five types of cohesion can be

identified on this basis: anaphora, substitution, ellipsis, lexical cohesion (e.g.

repetitions, synonymy) and conjunction, the latter being the only one responsible

for discourse relations. As pointed out by Webber (2006), anaphoric relations

have, however, no constraint on their locality, no constraint on the number of text

parts a given unit can depend on and no constraint on the discourse units that can

be linked together. The lack of constraints in this approach results in embedded

and cross-relations to be allowed.

Other approaches have taken a perspective that combines constituency

and dependency in order to account for discourse relations. In the mixed

approaches constituency and dependency participate in shaping the discourse

structure and determining its cohesion. Wolf and Gibson (2005) discourse

structure relations, a set of informational relations based on Hobbs (1985), are

associated with constituency alone. However, they do not separately account for

anaphoric dependency which is responsible for non-adjacent discourse segments.

In this way their approach can be seen as part of the mixed approaches.

In their theory of discourse graphs, Wolf and Gibson identify discourse

segments as non-overlapping spans of text constituted either by a clause or an

attribution. Segments which are related on the basis of a common topic or

attribution are grouped together. Groups can also engage in a discourse relation


- 11 -

with a clause or another group. This results in a sort of hierarchical structure which

can be related to constituency. Moreover, Wolf and Gibson argue that tree-

structures are not adequate for accounting for discourse coherence and propose a

chain-graph in order to represent problematic aspects such as nodes with multiple

parents and cross-relations.

Although their approach tends to associate discourse structure solely with

constituency, dependency plays an important role in determining their claims.

Cross-relations, which appear to be quite frequent and mainly associated with the

relation of ‘elaboration’, could be explained through dependency. Webber (2006)

notices that cross-relations, which represent the main argument against the tree-

structure, are often anaphoric dependencies.

Also mixed, although very different, is the Lexicalized Tree-Adjoining

Grammar for Discourse (D-LTAG) approach (Cristea and Webber, 1997, Webber

et al., 2003). Discourse relations are lexicalised in the sense that this theory

provides an account of the lexical anchors bearing them. The arguments to these

relations are also lexicalised. Each lexical entry is associated with a set of tree-

structures specifying its syntactic configuration. In this lexical variant of TAG, the

adjoining operation, which is available at the right frontier, is paired with the

operation of substitution.

Adjoining is the operation of “identifying a discourse relation between the

new material and material in the previous discourse that still is open for

elaboration” (Cristea and Webber, 1997:91). Cristea and Webber introduce

substitution in order to account for discourse features (e.g. although, on the one

hand, suppose) arising expectations about what is to come in the following

discourse.

Figure C shows above the grammatical categories (where * is the foot of an

auxiliary tree and ↓ a substitution site) and below the adjoining and substitution

operations.


- 12 -

Figure C - - (L-TAG) Tree examples (Cristea and Webber, 1997)

Structural (i.e. conjunctions, subordinators) and empty connectives are the

anchors of elementary trees. These discourse relations between arguments

produce a compositionally interpreted structure. Discourse adverbials exploit

instead anaphoric dependency, establishing a discourse relation connecting the

interpretation of a clause to the interpretation of a previous clause or group of

clauses.

2.2 Discourse Annotation Projects

Discourse annotation projects are becoming popular in recent years due to a

growing interest in better understanding discourse structures in order to

automatically interpret or reproduce it. A survey of these projects is presented in

this section.

2.2.1 RST-DT

The Rhetorical Structure Theory Discourse Treebank (Carlson et al., 2003) is a

corpus of 176,000 words from the Penn TreeBank, hence consisting of articles

from the Wall Street Journal (WSJ). Realised in the framework of the RST, the

RST-DT corpus is annotated for rhetorical relations holding between two or more

adjacent and non-overlapping text-spans.

In order to construct the discourse tree, they first proceed to identify its

minimal building block, the elementary discourse unit (EDU), which is the clause.


- 13 -

Once the text has been segmented, adjacent EDUs are linked via rhetorical

relations thus creating a hierarchical structure. The inventory of rhetorical relations

they employ consists of 53 mononuclear relations, where one of the spans is more

salient (nucleus) and the other conveys additional information (satellite), and 25

multinuclear relations, with equally salient spans.

2.2.2 The Penn Discourse TreeBank PDTB

The PDTB (Prasad et al., 2004; Webber et al., 2005; Prasad and Dinesh et al.,

2008) represents a fundamental work in the area of discourse for both its unique

approach to discourse relations based on the D-LTAG theory and the echo it has

produced, inspiring a number of recent studies and providing them with a strong

knowledge base. The PDTB is a discourse resource built on top of the PTB, the

Penn Wall Street Journal corpus. It consists of a million words annotated for

discourse connectives and their arguments. The annotation was chosen to be

stand-off as this is generally more clear than the XML in-line annotation and

because the arguments of different connectives could overlap, violating the syntax

of XML.

Although not tied to any particular theory of discourse, the approach taken

is grounded in the D-LTAG approach to discourse. The idea of a lexicalised

grammar for discourse results in a bottom-up approach that avoids recurring to a

pre-defined set of discourse relations as it is in other theories (e.g. RST). Focus of

the annotation are discourse connectives, considered as discourse predicates

taking two text spans as their arguments, and their arguments Arg1 and Arg2. In

the example (3) below, Arg1 is in italic and Arg2 in bold while the connective is

underlined. Discourse relations hold between Abstract Objects (AO), such as

propositions, events and states. The annotation was performed proceeding with

annotating a single connective throughout the whole corpus before taking into

consideration the following one as this was perceived as an easier task for the

annotators.

(3) Most oil companies, when they set exploration and production budgets

for this year, forecast revenue of $15 for each barrel of crude produced.

(Prasad and Dinesh, 2008:2)


- 14 -

Discourse connectives belong to three grammatical classes: subordinating

conjunctions (e.g. because, when), coordinating conjunctions (e.g. and, or) and

discourse adverbials (e.g. for example, instead). They can also appear as

modified or conjoined form (e.g. only because, if and when) or parallel form (e.g.

either…or, on the one hand…on the other hand). The senses of the connectives

are also annotated paying attention to their polysemous nature (e.g. ‘since’ can

have a temporal, causal or temporal-causal sense). Senses are hierarchically

classified according to their class, type and subtype as exemplified in Figure D.

Figure D - Sense classification of discourse connectives in the PDTB

Between adjacent text spans, discourse relations are annotated also when not

explicit, that is, when although they lack a discourse connective, the relation can

be inferred. In these cases a presumed connective is added to the annotation with

the exception of lexicalised discourse relation (AltLex), arguments linked by an

entity-based coherence relation (EntRel), and also when no relation is perceived

(NoRel).

Arguments to a connective can be non-consecutive (3) and anywhere in the

text and are constituted of single or multiple clauses or sentences. A principle of

‘minimality’ applies to them, which prescribes for each argument the selection of

the minimum sufficient span. Additional text related to the arguments can also be

included in a discourse relation as ‘supplement’ (Sup1, Sup2).

The annotation in the PDTB contains additional information as it also

specifies the attribution of connectives and their arguments. This aspect of the

annotation will be considered and analysed at a later stage in this thesis.

class

type

subtype

temporal – contingency – comparison - expansion

condition cause … …

… reason result


- 15 -

2.2.3 Other Projects

The Chinese Discourse Treebank

The Chinese Discourse Treebank (CDTB) project (Xue, 2005) is based on the

same principles of the PDTB. Similarly to the PDTB, and unlike the RST approach,

discourse relations do not represent a predefined inventory but are lexically

grounded and anchored by discourse connectives. Implicit and explicit Chinese

discourse connectives were investigated in order to add a discourse layer of

annotation to the Penn Chinese Treebank. Discourse connectives are also here

regarded as predicates taking two abstract objects as their arguments. In the

CDTB coordinating and subordinating conjunctions as well as discourse adverbials

are annotated.

The main challenges to the realisation of this project were disambiguating

lexical items which in Chinese could function both as discourse connectives and

non-discourse connectives, as well as determining the sense of polysemous

connective, and defining the argument scope. Due to the long morphological

evolution of the Chinese language another issue was determined by discourse

relations realised by more than one discourse connective. Hence, different

morphological forms had to be grouped as diverse realisations of the same

discourse relation. The task of annotating attribution is not included in the CDTB.

Discourse and the Prague Dependency Treebank (PDT)

Also inspired by the PDTB is the initial analysis conducted on the Prague

Dependency Treebank for the addition of a layer of annotation for discourse

(Mladová et al., 2008). The PDT is a corpus of 2 million word Czech journalistic

texts from the Czech National Corpus. Three levels of annotation are already

available: morphological, superficial syntactic and deep syntactic. In the latter each

sentence is represented by a dependency tree connecting clauses but not

trespassing sentence boundaries. In addition, however, some basic co-reference

relations are also marked and among those some textual co-reference relations

going beyond sentence boundaries.

Discourse relations will be added to PDT 3.0 in a fourth level of annotation,

containing various types of relations going beyond the sentence. The discourse

layer to be added to the PDT will use the PDTB as a background and define a new


- 16 -

hierarchy of discourse sense labels and exploit the discourse information already

carried by the deep syntactic level of annotation. Co-reference relations are

already marked for coordination, dependency and reference to the preceding

context, however these need to be explicitly marked for discourse and the PREC

label for relations going beyond sentence boundaries has to be further specified.

Discourse Annotation of the METU Turkish Corpus and the Hindi Discourse

Relation Bank

Still at their early stages are two recent projects aiming at developing a discourse

resource for Turkish and Hindi respectively. Both are based on the theoretical

assumptions postulated by the PDTB and focus on the analysis of discourse

connectives. From these studies also emerges a certain interlanguage validity of

the PDTB schema and the similar approach adopted makes them a valid source

for cross-linguistic comparison.

Most of the work to date in order to prepare the ground for the discourse

annotation of the METU Turkish Corpus (Zeyrek and Webber, 2008) has been the

identification and classification of discourse connectives together with a

preliminary analysis of the argument scope. The attribution of discourse relations

and other aspects such as the annotation of implicit connectives are still

unexplored.

Similarly, the project for a Hindi Discourse Relation Bank also adopts a

lexically grounded approach to discourse relations and is focusing on the analysis

of different types of discourse connectives and their realisation in the Hindi

language. Implicit connectives and their semantic classification, together with the

attribution of connectives and their arguments are left for future developments of

the present research (Prasad and Husain et al., 2008).

2.3 Attribution

This thesis originates in the framework of developing an Italian Discourse

Treebank which similarly to the developing discourse annotation projects for

Chinese, Czech, Hindi and Turkish is theoretically inspired by the PDTB. However,

unlike these projects, it does not focus on the classification and annotation of


- 17 -

discourse connectives but on attributions, an aspect included in the PDTB but only

in a subordinate way.

2.3.1 Towards a Definition of Attribution

Defining what attributions are is a trivial task, so trivial that it is not at all easy.

Although the annotation scheme for attribution in this project is derived from the

one in the PDTB, the definition they provide of attribution as “a relation of

‘ownership’ between abstract objects and individual or agents” (Prasad and

Miltsakaki, 2008:40) is not suitable to fully describe the relations that will be here

considered and investigated. AOs refer to propositions, events or states and do not

include smaller units such as noun phrases or even single words. Another

definition is given in the RST annotation manual:

“Speech acts—verbs that are used to report both direct and indirect speech--

should be segmented and marked for the rhetorical relation of ATTRIBUTION […]

Cognitive predicates, including verbs that express feelings, thoughts, hopes, etc.,

should also be segmented and marked for the rhetorical relation of

ATTRIBUTION.” (Carlson and Marcu 2001:7, 9)

More than attribution, what is defined here by describing the means by which it is

signalled, is the way of spotting attribution in the text. However, it is possible to

derive that attribution is just bound to reporting or cognitive predicates, leaving out

the cases when attribution is conveyed by prepositions (4) or just punctuation (5).

(4) According to the police, crime rate has fallen this month.

(5) The Pope: “ I will pray for the victims ”.

Murphy (2005:131) provides a partial definition of attribution as “the transferral of

responsibility for what is being said to a third party.” This simple explanation,

meant to capture only the attribution of assertions, highlights however the

embedded nature of attribution, recognising a ‘third party’ in the relation. This

because any attribution in a text or speech event is already part of a


- 18 -

communicative event having in the writer/speaker its natural primary source. The

insertion of a ‘third party’ allows the writer/speaker to change this default attribution

and transfer the responsibility or ownership of a certain part to another source.

As all the above mentioned definitions of attribution are alone not sufficient

to capture the phenomenon into consideration in this thesis, a new definition will

be here proposed: attribution in a text is ascribing the ownership of an attitude

towards some linguistic material, i.e. the text itself, a portion of it or their semantic

content, to an entity. This ownership is expressed by explicitly inserting the agent

or experiencer holding the intellectual property of the linguistic material, which can

express an assertion or a mental state such as an opinion, a will or some

knowledge. Attributions as described above will be considered and investigated in

the current research.

2.3.2 Are Attribution Relations a Discourse Phenomenon?

In order to decide if attribution relations are a kind of discourse relation it is

necessary to specify what discourse relations are. The label of discourse holds for

those texts having a structure. This structure originates from cohesive elements.

“Where the interpretation of any item in the discourse requires making reference to

some other item in the discourse, there is cohesion” (Halliday and Hasan,

1976:11). If generating cohesion would represent the sufficient and necessary

condition to identify a discourse relation, attribution would surely belong to this

class.

The interpretation of an attributed element is highly dependent on its

source. Bergler (1991) distinguishes between ‘primary’ and ‘circumstantial

information’, the first being the ‘pure’ information and the latter the ‘primary

information’ within a perspective, a belief or a modality, and argues that the

interest of tasks such as knowledge extraction is ‘primary information’. She

however acknowledges the importance of the additional information carried by the

‘circumstantial information’ and stresses the intimacy of this relation. Although

‘primary information’ still is, after 18 years, the focus of knowledge extraction

tasks, the recent flourishing of studies aiming at capturing this intimate relation

shows a general understanding of the fundamental contribution of the

‘circumstantial information’ to the interpretation of the ‘primary’ one.


- 19 -

Attribution relations are therefore with no doubt cohesive relations. Cohesion,

however, is not enough to specifically describe discourse relations as this could

represent a characteristic of relations in general. Thus a syntactic relation would

also be classified as a discourse one and this should not be the case. The second

necessary condition identifying a discourse relation is that it should hold between

discourse segments. These should be non-overlapping spans of text, however in

the literature a unique definition is still lacking. Different discourse approaches also

adopt different discourse units. These can be intentional units (Grosz and Sidner,

1986), sentences (Hobbs, 1985), clauses or phrasal units (Mann and Thompson,

1988; Webber et al., 1999; Wolf and Gibson, 2005).

Relations of attribution can hold between sentences or inside them between

clauses or group of clauses, therefore it could be considered a discourse

phenomenon.

(6) "There's no question that some of those workers and managers

contracted asbestos-related diseases," said Darrell Phillips, vice

president of human resources for Hollingsworth & Vose. "But you have to

recognize that these events took place 35 years ago. It has no bearing

on our work force today." (PDTB 0003)

Skadhauge and Hardt (2005) argue in this respect that attribution is an intra-

sentential relation, referring to the RST Treebank where it is actually treated as

such, and develop a system that they claim can automatically identify it. The

assumption is that being an intra-sentential relation attribution is encoded at the

syntactic level. Attribution is also a syntactic phenomenon but surely not only that.

The premises leading Skadhauge and Hardt to this conclusion are grounded in the

RST Treebank approach to attribution which considers only intra-sentential

instances of it and only at particular conditions (i.e. a verb immediately followed or

preceded by a sentential complement position, and the phrase ‘according to’). The

conclusion, quite obvious, should be that a subset of attribution relation which are

syntactically grounded, those selected by the RST-DT, can be syntactically derived

and automatically identified.

Although a certain number of attributions are expressed at the intra-


- 20 -

sentential level, verbs are not the only cues signalling them (see examples (4) and

(5) above). They are certainly the most common ones, however, the attributed

span is often separated from the verb by intervening material, such as adverbs,

complements or even clauses. Only eluding the complexity of attribution relations,

considering only a subset of it, Skadhauge and Hardt could easily provide a

solution for the automatic identification of this problematical phenomenon. This

very partial solution demonstrates the importance of reaching a better theoretical

description of attribution and a full account of its characteristics.

From the present work attribution emerged as being also a discourse

phenomenon. This because it often operates at a higher level than the sentence,

connecting larger units such as sentences (6), but also clauses in separate

sentences. Moreover, very frequently it bears co-reference relations or better it is

bounded to them ( (7), (9)).

(7) LONDRA - Con I soldi della lotteria nazionale sarà creata un’”Accademia

Britannica per lo Sport”. Lo ha deciso il primo ministro, John Major, …

(ISST re050)

LONDON – With the money from the National lottery it twill be instituted a

“British Sport Academy”. It was decided by the Prime Minister, John

Major,…

Through the analysis of attribution it was also clear that it can also be a

syntactically encoded phenomenon, intra-sentential and even intra-clausal, with as

little as a single word functioning as the attributed material ( (8), (9)).

(8) “Sì”, le risponde convinta un’amichetta. (ISST cs060)

“Yes”, answers to her confident a friend.

(9) “…L’umanità deve proclamare uno storico sciopero ad oltranza fino alla

distruzione di tutti gli armamenti nucleari.” Le parole registrate di

Gheddafi, …(ISST cs039)

“…The world should proclaim a non-stop strike till the destruction of all

nuclear armaments.” Gheddafi’s recorded words,…


- 21 -

On the other hand attribution relations can involve much larger units than

sentences or clauses and extend to the whole text or speech, reaching the

shallowest level of attribution, the one already easily captured by searching

engines, in which the source is the writer of the text or the person holding a

speech, or even the newspaper or website including the article. At this level the

attribution is often conveyed by prosodic or extra-linguistic means, e.g. the

inclusion in the web-page/ newspaper/ book, a graphic pointer (Figure E), the

sound provenance.

Figure E - Graphic extra-linguistic attribution

(http://www.metrokitty.com/comics/webcomics/medterms/comic_mterms.png)

For the purpose of the present study attribution will be considered at every level it

can be found, however the main account of it will be as a discourse phenomenon.

It will be considered at the discourse level itself, when sentences, propositions or

clauses or groups of them are attributed and at the sentence or even clause level,

with single words or noun phrases being attributed. However, the analysis will be

in this case limited to those instances coreferential to a discourse unit ( (7), (9)),

hence these could be also considered, in combination with the coreferential

relations, a discourse relation. The shallow level of attribution will also be included

in the annotation as the text, which is always a newspaper article, will have the

writer as its primary source, even when the writer is not directly mentioned in the

article. The attribution of the entire article to its writer will be assumed as default

and left implicit with some exceptions.


- 22 -

2.4 Related Studies

Attribution relations have already been included in some studies. These either

have their focus on some other discourse aspect and account for attribution only

marginally or limit their analysis to some level of attribution, e.g. the macro-level or

the intra-sentential or word level, thus neglecting attribution at the discourse level.

Nonetheless they represent a knowledge base and a starting point for the present

study. The annotation schema is in fact derived from the annotation schemas

proposed by these projects. In this section the most influential ones will be

reviewed.

2.4.1 GraphBank

The relation of attribution is included in the GraphBank (Wolf and Gibson, 2005) as

an asymmetrical or directed relation, together with cause–effect, condition, violated

expectation, elaboration, example and generalization. In contrast to symmetrical or

undirected relations, i.e. similarity, contrast and same, directed relations hold from

satellite to nucleus nodes and are related to Mann and Thompson’s (1988)

mononuclear and multi-nuclear relations. Attribution relations go from the DS

containing the source to the DS which is the content of the attribution. Attributions

in the GraphBank are separated only when the attributed material is a sentence or

group of sentences or a complementizer phrase (10). These DSs are grouped if

they are attributed to the same source. In the other cases they are treated as

single discourse segments (11).

(10) 1. John said that

2. the weather would be nice tomorrow.

(Wolf and Gibson, 2005:254)

(11) 1.The restaurant operator cited transaction costs from its 1988

recapitalization.

(Wolf and Gibson, 2005:251)

Wolf and Gibson added attribution to the relations in Hobbs (1985) as they are

dealing with text taken from news corpora. However, they consider attributions,


- 23 -

more than coherence relations themselves, just as “carriers of coherence

structures” (Wolf and Gibson, 2005:251).

2.4.2 Opinion Corpus

Connected to attribution are also works in the fields of opinion and emotion

annotation and recognition. The most consistent and closely related study in this

respect is the Opinion Corpus (Wiebe, 2002; Wiebe et al., 2005; Wilson and

Wiebe, 2005). It consists of more than 11.000 sentences from the world press,

annotated for ‘private states’. This term covers: opinions, beliefs, thoughts,

feelings, emotions, goals, evaluations and judgements. A private state consists of

“an experiencer holding an attitude, optionally toward an object” (Wiebe, 2002:4).

Private states partly overlap with the types of attribution considered for the present

study. Although feeling and emotions are not part of the annotation, therefore the

‘object’ of the private state is not optional, other categories such as beliefs and

thoughts are included, together with assertions.

For the annotation of private states Wiebe et al. (2005) create three frames

corresponding each to a type of private state expression: explicit mention of

private states, speech event expressing private states and expressive subjective

elements. Key elements of these frames are: the ‘text anchor’, namely the text

span representing the speech act or the private state; the ‘source’, employed to

refer to both the experiencer of a private state and the writer or speaker of a

speech event; the ‘target’, although this is only included in the first two frames;

some properties. Properties include the ‘intensity’ of the private state, the

‘expression intensity’, which denotes the contribution of the text anchor to the

intensity of the private state, ‘insubstantial’, when a private state is e.g. in the

scope of a conditional and is therefore not presented as real in the discourse, and

‘attitude type’, accounting for the polarity of the private state.

Assertions are annotated through the ‘objective speech event frame’ if the

target is presented as an objective fact. Another important aspect of their

annotation is the inclusion of an agent frame in order to identify with a unique ID

every source in the text. This feature is particularly significant in order to deal with

bridging or pronominal anaphora, that is when a same source is repeated several

times with different nouns or pronouns being involved and making the identification


- 24 -

of a unique source quite challenging.

Sentences presenting private states and speech events are analysed in

three parts. With ‘on’ it is designated the text anchor corresponding to the private

state or speech event itself. ‘Outside’ includes instead the source and everything

else in the sentence outside the scope of the private state or speech event, which

is labelled as the ‘inside’.

(12) outside: “On Tuesday, John …while hanging up the phone.”

on: “said that”

inside: “he was leaving”

(Wiebe, 2002:8)

The Opinion Corpus surely represents a model and a knowledge base for the

present study regarding the annotation of attributions. This model needs however

to be expanded to go beyond the sentence boundaries, in order to avoid

approaching attribution once again merely as a syntactical intra-sentential

phenomenon.

2.4.3 PDTB - The Penn Discourse TreeBank

Apart from annotating lexically grounded discourse relations in the form of

discourse connectives and their arguments, the PDTB goes further also including

attribution relations in the annotation. Considered as a “relation of ‘ownership’

between abstract objects and individuals or agents” (Prasad and Milsakaki et al.,

2008:40), attribution often overlaps with discourse connectives and their

arguments. Also discourse connectives are establishing relations between AOs

and can therefore hold between attributions (13) or just between the AOs

representing the content of attributions (14). The discourse relation itself can be

the AO representing the content of an attribution relation (15). In the examples that

follow, taken from the PDTB 2.0, the text spans corresponding to Arg1 are shown

in italics , those for Arg2 are in bold, the discourse connectives are underlined and

the attribution phrases are identified by small capitals.


- 25 -

(13) ADVOCATES SAID the 90-cent-an-hour rise, to $4.25 an hour by April 1991, is

too small for the working poor, while OPPONENTS ARGUED that the increase

will still hurt small business and cost many thousand of jobs. (PDTB

0098)

(14) Factory orders and construction outlays were largerly flat in December while

PURCHASING AGENTS SAID manufacturing shrank further in October.

(PDTB 0178)

(15) “The public is buying the market when in reality there is plenty of grain to

be shipped,” SAID BILL BIEDERMANN, ALLENDALE INC. DIRECTOR. (PDTB 0192)

Discourse connective and attribution relations appear as separate layers that can

occur independently or coexist overlapping or even being included one in another.

The approach taken by the PDTB, however, considers attribution as subordinate to

the identification and annotation of discourse connectives and as the focus is on

the latter, attribution appears more as an additional feature to be added to

connectives and their arguments than as an independent phenomenon. Attribution

is in fact annotated in the PDTB only and every time a discourse relation exists,

thus leaving out those instances of attribution to be independently found.

Moreover, what is actually marked is the attribution of the discourse connective

and of its two arguments Arg1 and Arg2. Therefore, a nested attribution included

e.g. in one of the arguments cannot be accounted for and is also left unmarked. In

the example below (16), the discourse relation in quotes is attributed to ‘Gov.

Nelson Rockefeller of New York’ and there is no account of the nested attribution

of an intention expressed by ‘want’ and concerning the span: ‘to keep the crimes

rates high’.

(16) In 1966, on route to a re-election rout of Democrat Frank O’Connor, GOP

GOV. NELSON ROCKEFELLER OF NEW YORK appeared in person SAYING, “If you

want to keep the crime rates high, O’Connor is your man.” (PDTB 0041)

Key properties of attribution included in the PDTB annotation scheme are: source,


- 26 -

type, scopal polarity and determinacy. The source feature specifies if the source of

the attribution, i.e. the agent in the relation of ownership, is the writer (Wr), another

specific agent (Ot), or an arbitrary source (Arb). The writer is always marked as the

source when no explicit attribution is made (17). While ‘Other’ refers to a

determinate source either explicitly mentioned (15) or inferable from some other

occurrences in the text, ‘Arbitrary’ sources are lacking a referential agent. This

happens for example in case of an impersonal source or an attribution with an

agentless passive verb or an adverb (17) as the reporting phrase. In the following

example, the relation and Arg1 are attributed to the writer, while Arg2 is labelled as

arbitrary.

(17) East Germans rallied as officials REPORTEDLY sought Honecker’s ouster.

(PDTB 2278)

Another feature of attribution in the PDTB is the type. This partly accounts for the

degree of factuality of the AOs. Type can take four values: assertions, beliefs, facts

and eventualities. Assertion propositions (Comm) are generally conveyed by verbs

of communication (18), e.g. ‘say’, ‘explain’, ‘announce’. Implicit attributions to the

writer (19) also take this value. Belief propositions, which partly correspond to the

‘private states’ of opinions, beliefs and thoughts, are instead expressed by

prepositional attitude verbs (20), i.e. verbs entailing a mental process such as

‘think’, ‘believe’, ‘doubt’, and are labelled as PAtt.

(18) “We won’t put any burden on Farmers,” HE SAID. (PDTB 2403)

(19) Besides, to a large extent, Mr. Jones may already be getting what he wants

out of the team, even though it keeps losing. (PDTB 1411)

(20) Scientists need to understand that while THEY TEND TO BELIEVE their work is

primarily about establishing new knowledge or doing good, today it is also

about power. (PDTB 1495)

Facts are associated with factive and semi-factive verbs and involve the attribution


- 27 -

of an AO presented as factual. To this type belong verbs of perception such as

‘hear’, ‘know’, ‘remember’. The last type of attribution verbs has to do instead with

agents holding an intention or attitude towards the AO. Prasad and Miltsakaki et al.

(2008) present eventualities (Ctrl) as conveyed by control verbs. These are: verbs

of influence (21), such as ‘order’, ‘allow’ and ‘persuade’; verbs of commitment,

such as ‘agree’, ‘promise’ and ‘accept’; and verbs of orientation such as ‘hope’,

‘want’ and ‘wish’.

(21) Eward and Whittington had planned to leave the bank earlier, but MR.

CRAVEN HAD PERSUADED THEM to remain until the bank was in a healthy

position. (PDTB 1949)

Another feature added in the PDTB to attribution is scopal polarity. This is a

feature that allows identifying cases when a negation which on the surface

appears to scope on the attribution verb, changes instead the polarity of the

attributed AO (22). It is important to recognise the real scope of the negation as

this affects the last feature present in the annotation of attribution in the PDTB:

determinacy. Determinacy has to do with the truth value of the attribution. In case

of an attribution verb being in the scope of a negation (23), or e.g. in a conditional

or infinitive context, the attribution itself is not presented as real and it should be

handled as such when drawing considerations about the AO on the basis of this

relation. This does not mean that the attribution is therefore certainly unreal as it

could also be that the attribution is just shown as possible (24) or probable.

(22) I DON’T THINK it’s a main consideration. (PDTB 0090)

=

I THINK it’s not a main consideration.

(23) Yet the Soviet leader's readiness to embark on foreign visits and steady

accumulation of personal power, …, DO NOT SUGGEST that Mr. Gorbachev is

on the verge of being toppled; (PDTB 0439)


- 28 -

(24) SOME MAY BE TEMPTED TO ARGUE that the idea of a strategic review merely

resurrects the infamous Zero-Based Budgeting (ZBB) concept of the Carter

administration. (PDTB 0692)

A last issue addressed by the PDTB concerns the annotation of the attribution text

span. The attribution span corresponds to the material containing the information

about the source, the type, the scopal polarity and the determinacy of the

attribution. The AO is usually annotated separately. The attribution spans “are

often left unexpressed in the sentence in which the AO is realized, and have to be

inferred from the prior discourse” (Prasad and Miltsakaki et al., 2008:48). When

the attribution is to the writer and implicit, no text span is selected.

The text span also includes, for every element part of it, its non-clausal

modifiers e.g. adverbs and appositive noun phrases. In some cases the attribution

span can be represented by a non-clausal phrase as prepositional groups such as

‘in the eyes of’ and ‘according to’ (25), or adverbs like ‘reportedly’ and ‘allegedly’

can also represent the text anchor of attribution. When one of this constructions

and not a verb signals the attribution relation, the attribution span is a non-clausal

phrase. Non-clausal attributions are included in the argument span corresponding

to their AO as the PDTB annotation conventions do not allow keeping phrasal

modifiers separate from the span they modify (25).

(25) No foreign companies bid on the Hiroshima project, ACCORDING TO THE

BUREAU. But the Japanese practice of deep discounting often is cited

by Americans as a classic barrier to entry in Japan’s market. (PDTB

0501)

2.5 Summary

This chapter has presented a review of different approaches to discourse structure

and coherence relations, introducing different theories and surveying the main

projects regarding the construction of discourse annotated resources.

Attribution relations have been proved to be not only a syntactic, intra-

sentential phenomenon, as they have been regarded by some studies, but also to

scope over discourse units and even to relate extra-textual material. A new


- 29 -

definition of attribution has also been proposed, in order to supply for the need of

one adequate to describe the scope of the present study.

Some annotation projects involving attribution were also reviewed. The

annotation schema developed in this thesis will be grounded on these projects,

though with some modifications. In order to provide a complete account of

attribution, it is necessary to extend and adapt these annotation schemas to the

range of linguistic units between which this relation can hold (i.e. word, clause,

sentence, discourse segment, discourse). Moreover, as the complexity of such a

wide scope suggests, an approach to attribution independent from other syntactic

or discourse phenomena will be adopted.

The benefit of this approach will be reaching a better description of the

phenomenon and the development of a complete resource to be employed for

attribution related studies.

3 An Analysis of Attribution

- 30 -


Before proceeding with the definition of an annotation schema for attribution, a

deeper understanding and description of the phenomenon is required. Attribution

will be segmented in its constitutive elements, which represent the fundamental

units of the annotation. This will also enable a more considerate selection of the

features to be included in the schema. Moreover, the analysis of the different

components playing a role in the attribution relation will provide an account of the

different lexical elements possibly representing them.

Finally, some characteristics of attribution and various issues representing a

challenge for the annotation will be discussed and possible solutions proposed.

3.1 The Components of Attribution

Attribution relations are intuitively composed by at least two elements: the

attributed linguistic material and the entity this is attributed to. The latter is usually

referred to as the source (Prasad et al., 2007; Wiebe, 2002), which includes the

experiencer of an emotional state as well as the writer or speaker of a text. The

former, due to the multiplicity of its possible referents, has not a unique label.

In the literature the attributed element has been termed as the ‘text’ or

‘document’, when dealing with document-level attribution, as the ‘AO’ (Prasad et

al., 2007), representing a discourse unit, or interchangeably, when annotating

opinions, as the ‘object’, ‘content’, ‘inside’ (Wiebe, 2002) or ‘target’ (Wiebe et al.,

2005) towards which a certain attitude is held by the source. As AO refers to a

discourse segment, which is not always the case in this study, this term will not be

used. The terms proposed for the annotation of opinions are all equally valid,

however, in order to avoid confusion, the attributed linguistic material will be

univocally identified here as the content.

In addition to source and content a third element is fundamental in the

relation: the lexical anchor signalling the existence of an attribution. This has been

assimilated to the source in the PDTB and jointly annotated as the ‘attribution

phrase’. In the manual for the sentential annotation of opinions (Wiebe, 2002:6)

“the private-state or speech event phrase itself” is identified as ‘on’. In the


- 31 -

annotation scheme proposed later (Wiebe et al., 2005), this element is included in

both the private state and speech event frames as the ‘text anchor’. Although ‘text’

and ‘lexical anchor’ will be occasionally employed, the element connecting source

and content will be in this work labelled as cue.

In the examples that follow, when the cue, the source or the content are

highlighted, this will be done as follows: the source span in bold, the cue

underlined and the text corresponding to the content in italics.

3.1.1 The Source

The source of an attribution relation is the entity the content is ascribed to.

Sources are usually the agents (26) of a speech event, when it is a statement to

be attributed, or the experiencers (27), if dealing with a ‘private state’.

(26) Chairman Krebs says the California pension fund is getting a bargain price

that wouldn’t have been offered to others. (PDTB 0331)

(27) Sue thinks that the election was fair. (Wiebe et al, 2005:9)

However, things can get a lot more complicated than this. Quite frequently

mentioned sources are not animate agents. Contents are often attributed to

institutions or knowledge repositories, such as law codes, studies, reports and

newspapers. Although these are usually in a metonymical relation to the actual

animate source, this is deliberately left out of the attribution as unknown, irrelevant

or even a plurality (28). In the example below (29) the content is a piece of

information which needs a reliable source to be considered trustworthy. Ascribing

the content to a major newspaper is here more effective than directly citing an

unknown journalist.

(28) La Costituzione prevede la mozione di fiducia per battezzare un governo,

quella di sfiducia per farlo cadere. (ISST els035)

The Constitution prescribes a trust motion to establish in office a

government, a distrust one to destitute it.


- 32 -

(29) Il quotidiano Ma’ariv riporta che è stato rafforzato il servizio di

sorveglianza attorno a Rabin, al capo di stato maggiore Shahak, al ministro

degli Esteri Peres, a quello della Polizia Shahal e dell’Ambiente Sarid.

(ISST cs042)

The newspaper Ma’ariv reports that it has been increased the vigilance

service for Rabin, the Chief of Staff Shahak, the Foreign Secretary Peres,

the Police minister Shahal and the Environment minister Sarid.

In other cases the source is not an agent but a specification or an adjective of its

metonymic referent, e.g. the words for the speaker, the document for the writer, in

agentive position ( (30), (31)).

(30) According to John’s declaration, Mary left the party before midnight.

(31) The presidential report announced that the Defence Minister resigned

today.

When a source is adding credibility to the content it is related to, it is usually

explicitly mentioned through the attribution relation. However, especially in

journalistic texts, attribution relations serve another purpose: they remove liability

from the writer, interposing another source. Sometimes this strategy is used when

the provenance of the information in the content is not certain or not known. In this

case the metonymic source is lacking a specific referent on purpose (32). In this

way the writer is not assuming the responsibility of the given statement, without

really attributing it to another specific source.

(32) …secondo indiscrezioni avrebbe sostenuto davanti agli investigatori che

non intendeva fare nulla di male e che per lui si è trattato di un “gioco”.

(ISST cs004)

…according to indiscretions he would have told the examining magistrates

that he didn’t intend doing anything bad and that for him it was just a

“game”.


- 33 -

(33) Secondo anticipazioni l’esame del Consiglio di Stato avrebbe avuto un

esito positivo e il regolamento dovrebbe ricevere il semaforo verde ai primi

di giugno. (ISST sole153)

According to anticipations, it seems that the Council of State examination

had a positive result and the regulation should get the starting signal the

first days in June.

Sources without a corresponding referent can also be indefinite entities, e.g. ‘the

people’, ‘someone’, or impersonal pronouns, e.g. ‘one’, ‘you’. Moreover, an

attribution relation can exist although paradoxically one of its constitutive elements,

the source, is missing. This effect is achieved through the use of e.g. a passive

attribution verb lacking the agent (34), a past participle (35) or an infinitive.

(34) É stato detto che si tratta di sport, non bisogna farne una tragedia; (ISST

els060)

It has been said that we’re dealing with sport, we shouldn’t make a fuss out

of it;

(35) L’accordo annunciato ieri… (ISST sole101)

The agreement announced yesterday…

As Italian is a pro-drop language, quite often the source is left implicit. This,

however does not mean that it is missing. It corresponds in fact to the implicit

personal pronoun of the attribution verb, usually coreferential to the explicit entity

mentioned somewhere else in the text.

(36) Probabilmente Vialli non ha dimenticato le voci sulla sua presunta vita

allegra durante i Mondiali del 1990 rivelate su Italia1 da Maurizio Mosca. E

Ø non crede che la recente alleanza tra Juventus e Milan possa cambiare

molto il comportamento dei commentatori sulle emittenti di Berlusconi.

(ISST cs043)

Probably Vialli has not forgotten the rumours about his presumed ‘happy

life’ during the 1990 World Cup revealed on Italia1 by Maurizio Mosca. And


- 34 -

(he) doesn’t believe that the recent alliance Juventus-Milan could really

change the commentators’ behaviour on Berlusconi’s televisions.

3.1.2 The Content

The content of an attribution could be regarded as the nucleus (Wolf and Gibson,

2005) of the relation. The source and also the cue act as satellite elements,

therefore, according to the RST theory (Mann and Thompson, 1988), they convey

additional information. As it has been already mentioned, the content can be

constituted by different linguistic units.

Word or phrase

A single word or phrase can already constitute the content of the attribution as in

( (37), (38)). This is not only the case when this represents, although short, a

complete utterance directly reported. ‘Yes/ no’ function in this case as a sentence

substitute and therefore contribute to textual cohesion (Renzi, 1995). Very often,

what is attributed is not directly the content, but its ‘container’ (39).

As the main reason behind the creation of an annotated resource for

attribution is to be able to link the content with its source in order to allow a more

correct semantic interpretation of it and to account for its provenance, the

attribution of a ‘container’ of information appears at first not relevant. Therefore,

these words or clauses would not require annotation. In the example (39) knowing

that the ‘press release’ has been issued by ‘Palazzo Chigi’, is not necessary as it

neither represents some linguistic material directly asserted by the source, nor it

conveys any piece of information that could be ascribed to the source.

(37) The minister addressed the president calling him “padrino”.

(38) “Sì”, le risponde convinta un’amichetta. (ISST cs060)

“Yes”, answers to her confident a friend.

(39) Palazzo Chigi emette un nuovo comunicato. (ISST els048)

Palazzo Chigi (seat of the Italian Government) issues a new press release.


- 35 -

However, this is different in case of event anaphora, when content is also

expressed, although somewhere else in the text (40). The annotation of the

attribution relation binding the source to the ‘container’ of the attributed span would

allow the actual content, once this metonymic co-reference relation is resolved, to

inherit the attribution relation. Similarly, the content can often be found expressed

by just a pronoun (41) co-referentially recalling the attributed utterance. In the

examples below, the content represents an instance of event anaphora, a relation

often intertwined with attribution.

(40) Palazzo Chigi emette UN NUOVO COMUNICATO. <<Sarà il governo>> scrive

<<a prendere una decisione in piena autonomia e responsabilità>>. (ISST

els048)

Palazzo Chigi issues A NEW PRESS RELEASE. <<It will be the Government>>

(it) writes <<to assume a fully autonomous and responsible decision>>.

(41) “…Dobbiamo fare un ulteriore salto di qualità, entrare in una nuova

mentalità”. A dirLO è Giuseppe Signori, … (ISST re126)

“…We have to achieve an additional quality leap, enter a new mentality”. It

is Giuseppe Signori to say IT, …

Finally, it is also possible to find a verb as the content, and at the same time cue,

of the attribution. This happens with verbs such as ‘confermare’ (to confirm),

‘accettare’ (to accept) ‘negare’/ ‘rifiutare’/ ‘smentire’ (to deny), which implicitly

involve, because of the semantic of the verb, the production of a ‘yes/ no’

utterance. In this case, however, it is not necessary to link source and content as

the verb is already syntactically connected to its subject, or object in case of a

passive verb.

Clauses

More often it is a larger linguistic unit to be attributed. This can still happen intra-

sententially, when the content is a single clause, or more than one (42). Reported,

direct (43) or indirect speech is also usually represented at the sentence level.

Source and verbal cue together often constitute the main clause while the content


- 36 -

is the direct object (42) of the attribution verb. The attributed span can be

expressed by a subordinate or embedded clause. The content might also

represent the main clause, and the attributing span an incidental clause.

(42) Mr. Marcus believes spot steel prices will continue to fall through early 1990

and then reverse themselves. (PDTB 0336)

(43) "Vi daremo le statistiche alla fine", promettono i generali croati. (ISST

cs030)

“We’ll give you the statistics at the end”, promise the Croatian generals.

Sentences and larger units

Nevertheless, it is also common to find one or more clauses in a separate

sentence, or one or more full sentences (44), as the content of an attribution

relation. Discontinuous contents spreading over several sentences are often

associated to interviews (45) or testimonies, where the source and the cue are not

changing and do not need to be constantly repeated.

(44) "There's no question that some of those workers and managers contracted

asbestos-related diseases," said Darrell Phillips, vice president of human

resources for Hollingsworth & Vose. "But you have to recognize that these

events took place 35 years ago. It has no bearing on our work force today."

(PDTB 0003)

(45) Dunque, Ghezzi, che cosa significa "non cinema"? “… Per intenderci,

Moretti potrebbe girare tutta la vita ma non arriverebbe mai alla sensuosità

o fatalità cinematografica di un Michael Cimino...”. Sensuosità? E' un

concetto che ha a che fare con la forma? " Fino a Palombella rossa...".

(ISST cs050)

So, Ghezzi, what does it mean “non cinema”? “…To make it clear, Moretti

could shoot all his life but he would never reach the cinematic sensuality or

fatality of a Michael Cimino…” Sensuality? Is it a concept that has to do with

form? “Till Palombella rossa…”.


- 37 -

Finally, when dealing with news articles, the article itself represents a content,

whose source is the writer. The content of the article is in fact responsibility of its

author which is usually explicitly mentioned (Figure F).

Figure F - Newspaper article source

http://www-1.unipv.it/webbio/labweb/primantr/news/genetre2_giornale.gif

3.1.3 Elements Functioning as Cue

How is it possible to detect the existence of an attribution relation? The simple

juxtaposition of a source and a content together would not be enough unless some

other element provides the textual anchor that links them together. This element is

the attribution cue and it is realised by different linguistic elements. This can simply

be graphic elements, the use of punctuation, or grammatical and lexical devices.

Apart from establishing the relation, the cue has also another function: it

determines the kind of attribution e.g. a belief, a thought, an assertion, etc. While

punctuation cues always refer to asserted contents and prepositions alone do not

specify the nature of the relation, nouns and verbs can express several types of

attribution.


- 38 -

Punctuation Cues

Punctuation, double and single quotation marks (i.e. ‘…’, “…”) and less frequently

double angle brackets (<<…>>) or hyphens (-…-),represents, in Italian as well as

in English, the simplest cue to look for when searching for an attribution, although

it is not frequently the only one (46). However, this is not a reliable cue as it only

accounts for the attribution of assertions directly reported, leaving out indirect

speech and also the attribution of mental states such as opinions, intentions or

knowledge. Moreover, the same punctuation marks may as well be employed in

Italian to mention a word or a title, to signal an unusual usage (47) of one or a few

words such as an ironic or metaphoric use, or even to give emphasis to them. In

addition to that, single quotation marks are also used for the apostrophe and in

some cases, in order to avoid using special characters, to render accented glyphs.

(46) Il Papa: “La cultura ha bisogno del genio femminile”. (ISST cs014)

The Pope: “Culture needs the female genius”.

(47) Settembre, mese tradizionalmente <<caldo>>, non fa registrare vistosi

strappi al rialzo, sottolineando l’andamento verso il basso del costo della

vita. (ISST els020)

September, a traditionally <<hot>> month, doesn’t make record

considerable price rises, stressing the trend towards a reduction of living

cost.

Preposition and Prepositional Groups

Syntactic cues can be expressed by several word classes. Although attribution

verbs are by large the most common signal of the existence of an attribution

relation, nouns, adjectives, prepositions and adverbs can also function as cues.

While only one cue is required, it is common to find two or even more cues

combined together (48). A partial account of Italian cues, although only relative to

reported speech, can be found in Renzi (1995). In this grammar the prepositions

‘per’ (‘for’) and ‘secondo’ (‘according to’) (48) are listed. To them it should be

added, although they are not very frequent and some do not even occur in the

ISST corpus, the prepositional groups: ‘a detta di’ (according to), ‘a parere di’ (in


- 39 -

the opinion of), ‘agli occhi di’ (in the eyes of), ‘nell’ottica di’ (in the perspective of),

‘per quanto riguarda’ (as far as it concerns), ‘stando a’ (according to) (49).

(48) Non solo, ma secondo lo stesso Tronchetti Provera “da fornitore di cavi

siamo diventati fornitori di sistemi integrati”. (ISST re062)

Not only, but according to Tronchetti Provera himself “from supplier of

cables we became suppliers of integrated systems”.

(49) Oltre ai missili di questo tipo, stando alle stesse fonti, le navi partite dalla

Corea del Nord ne trasporterebbero:COND altri del tipo Styx... (ISST

els075)

Besides this kind of missile, according to the same sources, the ships that

left from North Korea, (apparently) transport other ones of the Styx kind…

Adverbials

Prasad and Miltsakaki et al. (2008: 43) identify some adverbials which may

function in English as attribution cues, such as ‘reportedly’ (50), ‘allegedly’,

‘supposedly’, etc. In Italian however, there is not a corresponding class. These

adverbials usually have an equivalent in a prepositional phrase: ‘a quanto si dice’

(according to what one says).

(50) East Germans rallied as officials reportedly sought Honecker’s ouster.

(PDTB 2278)

Nouns and Adjectives

While discussing the way the source can be expressed (3.1.1) it has been shown

how adjectives can assume this function. Adjectives establishing a relation of

possess between possessor and owned entity function as cue of an attribution

relation if the possessor is the source and the owned entity the content, or the

element coreferential to the content, as in the example below (51).

(51) “The Defence Minister resigned today”. The presidential announcement at

the press conference came unexpected.


- 40 -

Although nouns alone do not establish any relation between source and content,

they can function as ‘introductory elements’ (Renzi, 1995) following or preceding

the attributed material they represent. These nouns or NPs are very informative

about the typology of attribution, e.g. assertion (declaration, release, observation,

etc.), belief (doubt, idea, etc.) or intention (agreement, promise, desire, etc.).

Knowing the type of attribution is very relevant in order to discern if the attributed

material is for example an opinion, a statement or an intention. In the following

example (52), ‘la dichiarazione’ (the declaration) is the only element signalling that

the following material (highlighted in italics) is not attributed to the writer but to

another source, which is not at all mentioned. The noun itself, representing a

speech act, presupposes the existence of a source.

(52) Mi ha sconvolto la dichiarazione che tutto questo non vale niente. (Renzi,

1995:435)

It upset me the declaration that all this is worth nothing.

Renzi (1995) also observes that nouns or NPs functioning as attribution cues

usually have an argumentative structure or refer to speech acts, but also an act of

i.e. thought (53) or will.

(53) “It is nice to die for what you believe in; who is afraid, dies every day, who is

not afraid, dies only once”. With this idea the anti-Mafia magistrate Paolo

Borsellino worked till he was assassinated in 1992.

Grammatical Cues: Quotative Conditional

Some languages grammatically mark the fact that the writer/ speaker is not directly

presenting the information but there is an intermediary source. This grammatical

category is called evidentiality, as it accounts for “the evidence a speaker has for

his/ her statement” (De Haan, 2008:77). As the WALS map of the Semantic

distinction of evidentiality (De Haan, 2008:77) shows, the encoding of evidentiality

is a relatively common feature. In Europe, however, it is almost only indirect

evidentiality that can be expressed, without further distinguishing among different

modes of sensory evidence.


- 41 -

De Haan (2008) points out the fact that the languages presenting indirect

evidentials in Europe are mainly Germanic, with the exclusion of English, and

suggests that Finnish and French may have developed evidentiality because of

Germanic influence, as Ugro-Finnic and Romance languages do not present this

feature.

However, this is not exact as Italian possesses a grammatical structure to

express hearsay, i.e. the quotative conditional (54), similar to the French

“conditionnel de la rumeur”. Both languages however do not have a dedicated

grammatical category for evidentiality as the conditional is also used for other

purposes, e.g. unreality, attenuated wish, etc., expressing a number of factuality

degrees and epistemic modality.

Epistemic modality is often associated to evidentiality, as the information

source influences the degree of certainty the speaker expresses towards a

proposition. Although epistemic modality may be intertwined to the conditional, this

Italian mood is reportive and not inferential (Giacalone, 2007).

(54) Un incendio, che si sarebbe sviluppato:COND per cause accidentali, ha

gravemente danneggiato a Fiano (Torino), uno chalet di proprietà di

Umberto Agnelli, attiguo alla sua abitazione. (ISST cs010)

A fire, which (is said to have) developed for accidental causes, has severely

damaged in Fiano (Turin), a chalet belonging to Umberto Agnelli, next to his

residence.

According to Aikhenvald (2004) languages like Italian and French do not have

evidentiality as they do not have dedicated morphemes expressing it, but just

“evidentiality strategies” which originate from the verb mood and represent a

secondary function.

Nonetheless, quotative conditionals are very common in Italian and are an

important indicator of attribution, although, as Knott (1996) also remarks, they can

only be recognised in context as exemplified in ( (55)a-b). Moreover, more than

attributing the content to a source, quotative conditionals mark that the default

attribution to the writer is not suitable. They always refer to an indeterminate

unknown source, unless this is explicitly expressed by other means ( (49), (56)).


- 42 -

(55) a. Il presidente sarebbe:COND morto.

The president (is said) to be dead.

b. Il presidente sarebbe:COND morto, se non avesse usato la cintura.

The president would have died/ would be dead, if he wouldn’t have used

the seatbelt.

(56) Secondo anticipazioni l’esame del Consiglio di Stato avrebbe avuto:COND

un esito positivo e il regolamento dovrebbe ricevere il semaforo verde ai

primi di giugno. (ISST sole153)

According to anticipations, (it seems that) the Council of State examination

had a positive result and the regulation should get the starting signal the

first days in June.

Verb cues

Verbs are the most significant attribution cue in Italian as well as English. When

occurring at the intra-sentential level, they usually constitute the main clause

together with the source, while the content is expressed by a dependent clause

with (57) or without (58) the complementizer ‘che’ (that). The attribution clause

may not only occur before or after the content text span, but also enclosed in it as

an incidental clause, or even, although it is not a very frequent strategy, around the

content (e.g. Giovanni: “Tutto qui?” chiese con un sorriso./ John: “Is that all?” (he)

asked with a smile.)

Renzi (1995) groups these verbs in three categories: A) verbs expressing a

linguistic action, e.g. ‘raccontare’ (to tell), ‘telefonare’ (to phone), ‘rispondere’ (to

answer), ‘scrivere’ (to write), ‘ordinare’ (to order) etc.; B) verbs expressing the

reception of a linguistic act, e.g. ‘sentire’ (to hear), ‘intendere’ (to understand),

‘leggere’ (to read), etc.; C) verbs conveying a cognitive process, e.g. ‘pensare’ (to

think), ‘ricordare’ (to remember), etc.

The PDTB adopts instead a different and more fine-grained classification

(see 2.4.3). Assertions and eventualities partly overlap with A), facts should

correspond to B), and beliefs to C).


- 43 -

(57) Nella morte di Ivan Ilic, Tolstoj sostiene che in quel momento si va verso

una grande luce. (ISST els034)

In the death of Ivan Ilic, Tolstoj claims that in that moment we go towards a

big light.

(58) The BPC Fine Arts Committee think she had a literal green thumb. (PDTB

0984)

Another category of verbs can also be found as attribution cue that does not match

any of the above mentioned ones. As they cannot themselves function as

introductory devices for the content, Renzi (1995) suggests that these verbs

should be considered as implicitly presupposing one of the attribution verbs,

probably an hyperonym such as say or think. These verbs can be ascribed to two

different categories.

One includes verbs such as ‘iniziare’ (to begin), ‘continuare’ (to continue),

‘aggiungere’ (to add) (59) and ‘concludere’ (to conclude) which suggest the

existence of another attribution they usually follow, but may also precede.

Therefore these verbs could be considered as inheriting the type from the verb

they are linked to, which is usually an assertion, as they correspond to the

chronological phases of a speech event.

(59) “…Finché c’è chi lo difende e lo incoraggia, lui continuerà a comportarsi

così”, profetizza Storace. E aggiunge: “Ancora più grave poi è

l’atteggiamento del governo, che non prende posizione davanti alle

stronzate di Bossi perché la sua sopravvivenza dipende dai voti della Lega”.

(ISST cs027)

“…Till there is someone protecting and encouraging him, he will go on

behaving like that”, forecasts Storace. And (he) adds: “Even worse is the

attitude of the government, which does not take a stand against Bossi’s

absurdities because its survival depends on the votes of the Lega”.

The other includes verbs such as ‘sorridere’ (to smile) (60), ‘alzare le spalle’ (to

shrug the shoulders), ‘adombrarsi’ (to grow dark) (61), ‘rallegrarsi’ (to rejoice),


- 44 -

‘acquietarsi’ (to calm) etc. These verbs occur mainly in incidental position (Renzi,

1995). Most of them are part of what Levin (1993:219-220) classifies as verbs of

nonverbal expression and of gestures, observing that they are usually associated

with an emotion and mainly involve a facial expression or body parts, e.g. ‘annuire’

(to nod), ‘ammiccare’ (to blink), ‘corrugare’ (to wrinkle), etc. The rest of the verbs in

this group directly refer to an emotional change, often involving a change in the

intonation.

Talmy (2000:152) defines manner as “a subsidiary action or state that a

Patient manifests concurrently with its main action or state”. Therefore manner is

expressed in languages that cannot normally express it on the verb (e.g. Italian) as

two sub-events. As these attribution verbs express the manner of the verbs

conveying the attribution, this could be considered a sort of metonymical use

(manner for the action/ manner). Similarly to the continuative verbs, also these

verbs are usually associated with speech acts, therefore they could be seen as

specifying the hyperonym ‘say’ which is left implicit. In the following example the

verb used as cue, ‘sorride’ (smile), could be substituted with ‘dice sorridendo’ (says

while smiling).

(60) Arlacchi sorride: “Pura paranoia politica. Non ho partecipato ai lavori solo a

causa di un impegno privato…”. (ISST re095)

Arlacchi smiles: “Pure political paranoia. I didn’t participate in the works

only because of a private appointment…” .

(61) E' vero che doveva interpretare lei la parte di Bruce Willis in Pulp Fiction?

"Sì - si adombra Matt - Un ruolo interessante: con Tarantino eravamo a

buon punto, poi é arrivato Bruce. I suoi film incassano un po' più dei miei,

no? Hanno scelto lui", ride nervoso, tormentando il tappo a vite di una

bottiglia d'acqua minerale . (ISST cs060)

Is it right that you were going to play the role of Bruce Willis in Pulp Fiction?

“Yes - Matt grows dark - An interesting role: with Tarantino we were at a

good point, then Bruce arrived. His films cash in a bit more than mines,

right? They chose him”, (he) laughs nervously, tormenting the screw top of

a mineral water bottle.


- 45 -

3.2 Some Issues

The annotation of attribution relations rises several questions as how to deal with

peculiar aspects or issues of attribution which represent a challenge to the

annotation. These aspects arose from theoretical considerations as well as while

performing the pilot annotation and are particularly important as they determine the

choice of a suitable tool and shape the annotation schema. In this chapter some of

these features will be presented.

3.2.1 Nested Attributions

A pervasive characteristic of attribution is its recursiveness. Any attribution relation

can constitute the content of another attribution relation and this the content of

another one and so forth. The possibility of nesting an attribution into another

attribution is a potentially never-ending process. This could be exemplified as

follows, the capital letter representing the source and the brackets signalled by the

same small letter its content.

A [B {C (D |…|d )c }b ]a

Although not annotated in the PDTB, nested attribution require to be accounted for

in order to determine the truth or trustworthiness value of the embedded content.

Considering just the shallowest source, that is the most left, or the most embedded

one, hence the most right, or even an arbitrary intermediate one, could lead to

ignoring characteristics of the other ones which would possibly determine

important re-reading of the information in the content. Wiebe (2002; et al., 2005)

includes nested sources in their annotation schema, listing in the source ID and all

the sources in the sentence, with the addition of the writer, comprising a certain

text span in their content.

(62) [Sue said {that Mary believes (that Gore won the election)}].

Sources: [writer] {writer, Sue} (writer, Sue, Mary)

(Wiebe, 2002:5 - with the addition of brackets)

Formalising the effect the sources determine on the different embedded contents

would allow, once attribution relations have been recognised, the automatic


- 46 -

derivation of the truth value of information at different level of embedding. This

represents a simplistic abstraction as sources almost never differentiate so sharply

as ‘sincere’ or ‘liar’ but usually imply different degrees of reliability or bias they

project onto the content. This also vary according to the content topic as the

source expertise also vary.

Making use of Boolean logic it is however possible to draw some

considerations. Figure G represents a possible scheme of nested attributions.

Source ‘A’ is related to the content ‘a’ through the attitude it holds towards it (belief,

statement, desire, etc…). The content ‘a’ is formed by the relation ‘Bb’ occurring

between the source ‘B’ and its content ‘b’, which in turn is composed by ‘Cc’ and

so forth, plus optional additional material which is not part of the relation. The

trustworthiness and knowledge of the source ‘A’ determines the truth and reliability

of its content ‘a’ as in a relation of implication ‘A→a’. Substituting ‘a’ with its

correspondent ‘Bb’ the implication becomes ‘A→Bb’ that is ‘A’ implies the

attribution relation embedded in it. If ‘A’ is trustworthy, also the attribution relation

nested in it ‘Bb’ is and should be considered factual. Similarly every source implies

its content and therefore the attribution relation included in it: ‘B→b’, ‘C→c’, ‘D→d’,

‘N→n’.

Figure G - Nested attribution schema

However, when deriving the ‘truth’ value of a content ‘d’ all the sources of the

contents it is included in need to be considered. It is not sufficient that ‘D’ is

considered reliable, all sources to its left (i.e. A, B, C) need to be (Figure Ha). They

can therefore be joined with an AND relation (A Λ B Λ C Λ D) → d. To make it

simple, sources and contents which are reliable and taken into consideration are

here labelled as ‘true’, while sources and contents that are not, as ‘false’. Figure H

shows the ‘truth’ values (T/ F) of a nested content ‘d’, the arrows point to the

attribution relation between the content (small letter) and the source (capital letter).

A B C D …

a b c d


- 47 -

Proceeding from the inside to the outside, in case ‘D’ is ‘false’ (Figure Hb), ‘d’ is

already uncertain, and it is not necessary to also check ‘A’, ‘B’ and ‘D’.

Considering the example (63) and supposing an answer to the question ‘Is John

innocent?’ is required, probably his mother should not be considered as a reliable

source. In that case, the piece of information representing the most embedded

content is not relevant and cannot be considered as the answer. Did she really

made such a declaration? If ‘The Times’ and ‘the police’ are considered as reliable

sources, and also in this case this is an arbitrary decision, then it should be

assumed that this attribution relation is correct.

(63) The Times writes about the police saying that the murderer’s mother

declared: “John is innocent”.

Moreover, a ‘false’ source implies that everything to its right, therefore towards the

more embedded attribution relations, cannot be trusted although the other sources

to the right are ‘T’. Figure Hc shows a case in which ‘A’ is ‘true’ and so it is its

content ‘a’ and therefore the attribution of ‘b’ to ‘B’. Being ‘B’ ‘false’, however, the

content ‘b’ and everything contained in it cannot be considered ‘true’.

a)

b)

A B C D …

a b c d Λ Λ Λ

T T T T T/F

→

A B C D …

a b c d Λ

Λ

Λ

T T T T T

→


- 48 -

c)

Figure H - Truth values of a nested content

Discerning between sources which should be considered and sources which

should lead to the rejection of the content depends on subjective and domain

specific considerations. Once the sources have been sorted, determining if the

content is to be taken into consideration can potentially be turned into a Boolean

problem, as presented above. As already anticipated, however, determining the

relevance of a piece of information is not a Boolean problem as it involves

variables with domain sizes greater than the binary ‘true/ false’.

This means, sources are almost never completely reliable or completely

unreliable, but they occupy intermediate positions on a continuum between the

‘true’ and ‘false’ poles according to the field of information under consideration and

personal orientations and subjective characteristics of both the source and the

person considering the information. Algorithms for non-Boolean problems should

be more appropriately employed to fully deal with the degree of truth of the

content. This is better captured by fuzzy logic as the sources and the content have

a truth value ranging between 0 and 1.

Example (64), taken from real language use, presents four levels of nested

attributions. First from the outside, the writer of the article, or more generally the

newspaper publishing it of which the whole sentence represents a content. Second

the ‘New York Times’ which is reporting rumours, the third source, and last

‘Blinder’, holding the most internal content. None of these four sources needs to

be a priori discarded, although ‘rumours’ is surely less reliable as it has a non-

specific referent.

(64) Blinder, secondo voci riferite dal New York Times, sperava di succedere

al presidente Greenspan quando a marzo scadrà la sua nomina. (ISST

A B C D …

a b c d Λ Λ Λ

T F T T T/F

→


- 49 -

re070)

Blinder, according to rumours reported by the New York Times, hoped to

succeed to president Greenspan when in May his appointment will run over.

The more the sources, the more the passages a piece of information has gone

through, and therefore the chances it underwent transformations from its original

form as in the well known ‘Whisper Game’. Although not so common and usually

shallower than the nesting of an attribution relation in the content of another

attribution relation, the example (64) above shows that the source may also be an

attribution relation itself. ‘Rumours reported by the New York Times’ is in fact the

source of the content ‘Blinder…hoped to (…)’.

Conversely, this could be also interpreted and analysed as a different

perspective towards the same problematic. A source can present an attribution

relation in its content and the source of a nested attribution relation can be in turn

an attribution relation whose content is its local source and source the

superordinate one (e.g. Blinder, according to rumours reported by the New York

Times…/ The New York Times reports rumours saying that Blinder…).

3.2.2 Source of the Source

A special case of nesting could be considered what is here called ‘source of the

source’. In this case, the attribution relation makes explicit the presence of another

source which is not on the same level of embedding of the actual source, in that

case it would just represent an instance of multiple source (see 3.2.3). This added

source could be more internal as in the example (65), the ‘source of the source’ is

in small capitals, where the most embedded source ‘Maurizio Damilano’ is not

directly connected to the content by an attribution relation although this is

semantically inferable.

This type of additional source is usually dependent on verbs of perception

and knowledge, the ones labelled as ‘facts’ in the PDTB, as these correspond to

the verbs representing the reception of a linguistic act (Renzi, 1995), therefore

they more or less implicitly recall the production of a linguistic act. The sentence in

(65) could be in fact transformed into its reciprocal equivalent ‘Maurizio Damilano

told me about the disqualification of Garciano,…’, the original source becoming the


- 50 -

indirect object, the recipient of the speech act, while the ‘source of the source’,

signalling the provenance of the information, becomes the new source.

(65) (Ø) Ho saputo della squalifica di Garciano DA MAURIZIO DAMILANO, vi giuro,

non pensavo di arrivare primo. (ISST cs071)

(I) heard of the disqualification of Garciano FROM MAURIZIO DAMILANO, I

swear, I didn’t imagine I would have came first.

Both recipient and ‘source of the source’ are relevant to the attribution relation as

they inform about the source of a piece of information and the entity this was

addressed to. Both these elements can influence the way the content is perceived

( (66)a-b).

(66) a. The pope/ scientist says we do not derive from monkeys.

b. The scientist told THE PRESIDENT/ THE SCHOOLCHILDREN that asbestos is

harmful.

‘Source of the source’ could be also considered those instances where a more

external source is mentioned without directly relating it to its whole content but just

to the source embedded in it as in ( (67)- (68)). This strategy is often adopted when

the intermediary source is not particularly prominent and is not expressing

anything other than reporting the most embedded content. Example (68) could be

paraphrased as ‘The president’s spokesman Rossi said that the president

announced that a new anti-Mafia pool has been appointed’. Making the attribution

relation involving the ‘source of the source’ explicit, the spokesman would become

the subject of the sentence and therefore occupy a prominent role attracting

unneeded attention and diverting it from the more prominent source ‘the

president’.

(67) Poi però, TRAMITE LA FIGLIA che sta a Santiago, prima limita la portata del

colloquio con Gaston Salvatore (“non è stata una vera intervista, solo una

conversazione”), poi smentisce. (ISST period005)

Afterwards however, THROUGH THE DAUGHTER who lives in Santiago, first


- 51 -

diminishes the importance of the colloquium with Gaston Salvatore (“it

wasn’t a real interview, just a conversation”), then she denies.

(68) The president has announced THROUGH HIS SPOKESMAN ROSSI that a new

anti-Mafia pool has been appointed.

These second type of ‘sources of sources’ are expressed as an adjunct indicating

means. In that position they are also presented as less relevant, as if they were

neutral and not affecting the content. However, both types should be included in

the annotation as they do inform about the fact that the attribution relation is

second hand material and they surely need to be considered when computing the

disturbing effect of the ‘Whispering game’ as they add a level of nesting to the

attribution.

3.2.3 Multiple Sources, Contents, Cues

The main elements involved in an attribution relation are three, however, more

than one of each at a time can be involved in the relation. The most common is the

case when a source is holding the same attitude towards more than one content

as in the examples ( (69), (72)).

(69) (Ø) Ho detto |che ero dalla sua parte| e |che ritenevo giusta la sua protesta|.

(ISST cs063)

(I) said |that I was on his side| and |that I considered his complaint fair|.

Also quite common is the presence of more than one mentioned source (70). This

is different from collective sources, such as institutions, organisations, pluralities or

groups as multiple sources are separate entities or at least are presented as such,

e.g. John and Mary; the government, the army, and the civilians, etc... Often, like

in the example below one source semantically includes the other one which

represents a specification of the more general source. Multiple sources are more

common when expressing believes or knowledge as assertions or even opinions

usually belong to a single entity or to an entity presented as unanimous.


- 52 -

(70) Tutti, incluse le autorità, conoscono la loro provenienza, ma nessuno dice

e fa nulla per prevenire il massacro di capi selvatici. (cs.morph020)

Everyone, including the authorities, knows their provenance, but no one

says and does anything to prevent the massacre of wild animals.

Lastly, the cue itself or the attitude the source is holding towards the content can

be multiple. Often an attribution relation is signalled by several strategies, e.g.

According to what John suggests, ”the market is not ready yet”, however, they do

not interfere as they are all conveying that the content is a statement or a belief or

another kind of attribution. When cues represent instead two separate attitudes a

source holds towards the same content as in the example (72), this could be

considered a multiple source. In (71) instead both verb cues refer to linguistic

productions and could be grouped together. Multiple cues are not very common,

more frequently the presence of two different cues does not reflect a different

attitude but an evaluation about the content the writer expresses, suggesting in a

way the key to interpret the utterance as in (73) where a speech act directly

reported is bound to a cue labelling it as an opinion.

(71) … <<domani questa stessa gente é pronta a scendere in piazza per

rivendicare>> dicono e scrivono in molti. (ISST els063)

…<<tomorrow the same people are ready to take to the streets to claim>>

many say and write.

(72) The men can defeat immunities that states often assert in court by showing

that officials knew or should have known |that design of the structure was

defective| and |that they failed to make reasonable changes|. (PDTB 1160)

(73) “The journalists shouldn’t morbidly write about people’s sorrow” thinks

Mary.

3.2.4 Co-reference Resolution

Since source and content are often recalled by a pronoun or a coreferential

element, co-reference resolution becomes a fundamental issue when dealing with


- 53 -

attribution relations. The manual annotation could simply mark the coreferential

text span, nonetheless, the automatic capturing of the phenomenon would require

the resolution of anaphora and co-reference relations. Research in this area is

progressing, however, a tool able to resolve the kind of co-references involved in

attribution is still lacking.

Co-reference regarding the source is usually either bridging, e.g. El

Sayed….l’arabo…/El Sayed…the Arabian… (ISST els001), or pronominal

anaphora. Source anaphora often involves pronouns (74) recalling full nouns or

NPs, but also in Italian Ø subjects (75). The coreferential source or content is

presented in the examples below in small capitals.

(74) Secondo il governo di Pechino, le accuse in base alle quali due

diplomatici cinesi sono stati espulsi la settimana scorsa dagli Stati Uniti,

sono una montatura. LO ha detto ieri un portavoce del ministero degli

Esteri, IL QUALE ha anche annunciato che il governo cinese ha protestato

con quello degli Stati Uniti e che si riserva il diritto di ulteriori reazioni. (ISST

els075)

According to Beijing Government, the charges on the basis of which two

Chinese diplomats have been banned last week from the United States, are

a frame. IT was said yesterday by a spokesman of the Foreign Ministry,

WHO has also announced that the Chinese government has complained to

the one of the United States and that they reserve themselves the right of

further reactions.

(75) Probabilmente Vialli non ha dimenticato le voci sulla sua presunta vita

allegra durante i Mondiali del 1990 rivelate su Italia1 da Maurizio Mosca. E

Ø non crede che la recente alleanza tra Juventus e Milan possa cambiare

molto il comportamento dei commentatori sulle emittenti di Berlusconi.

(ISST cs043)

Probably Vialli has not forgotten the rumours about his presumed ‘happy

life’ during the 1990 World Cup revealed on Italia1 by Maurizio Mosca. And

(HE) doesn’t believe that the recent alliance Juventus-Milan could really

change the commentators’ behaviour on Berlusconi’s televisions.


- 54 -

The content is instead usually formed by clauses or sentences recalled by a

pronoun (74), but also a noun of which it represents an elaboration (see 3.1.2), as

in (76), where ‘words’ refers back to the whole direct quotation. Example (74)

contains three attribution relations of which two involve co-reference. The first

sentence/ attribution is in fact attributed to ‘a spokesman of the Foreign Ministry’

via recalling it by the personal pronoun ‘it’. The first attribution is nested in the

second not as usual with being inside its content span, it is in fact in a separate

sentence, but because of the event anaphora relating it with the content of the

attribution above. The source, ‘a spokesman of the Foreign Ministry’ is afterwards

recalled by the relative pronoun ‘who’ and becomes part of the last attribution

relation.


distruzione di tutti gli armamenti nucleari.” LE PAROLE registrate di Gheddafi,

…(ISST cs039)


nuclear armaments.” Gheddafi’s recorded WORDS,…

While still challenging, anaphoric expression such as pronouns have been deeply

investigated and some studies are also analysing event co-reference, which is

closely related to studies about temporality and time references. The co-

references included in attribution relations partly overlap with both research areas

as the source falls in the first group, i.e. bridging and pronominal anaphora, while

the content is partly of interest of the second one, i.e. event anaphora.

The resolution of co-reference is crucial in order to allow retrieving the

specific provenance of information as pronouns alone do not carry information

about reliability, expertise or bias of the source. Similarly, it is necessary to be able

to establish a relation between the source and what it has actually said, thought,

dreamt of, etc... In a sentence like ‘John has an idea’, linking ‘John’ to ‘idea’ is not

informative and would be of no use unless we can retrieve what John’s idea was.

As attribution is part of a bidirectional relation, not only linking linguistic

material to the entity expressing it but also entities to what they express, co-

reference also needs to point in both direction. Only a co-reference tool being able


- 55 -

to account for this bidirectionality would allow in the example (76), once the

material in quotes has been retrieved, to realise that this is coreferential to a NP

which is part of an attribution relation from which it should therefore inherit the

source. On the other hand, if the task is retrieving Gheddafi’s declarations, ‘words’

as such, although attributed to him, is not what he said and it should be possible to

clasp the coreferential quotation it stands for.

3.2.5 Scope Definition

The main challenge for the annotation of discourse phenomena, and annotation in

general, is reaching a precise scope definition which would not invalidate any

attempt to reach satisfactory interannotator agreement scores. As far as attribution

is concerned, it is important to define what to include in the cue and over which

text span the attribution relation holds.

The content is not always as easily detectable as when it is delimited by

quotes. Sometimes it is expressed by a pronoun or full noun recalling it as

discussed in (3.2.4), other times, due to the ambiguity of language, it is not clear

what exactly is the attributed span and what is possibly just additional material. In

case of multiple insides for example, the presence of a conjunction (77) is not

sufficient to assume the second span is also a content as it often represents some

additional information or even a comment the writer expresses. To be sure the text

span should also be attributed there should be also the subordinator ‘that’.

(77) The president said that the economy is on the verge of a severe crisis AND

|he is going to meet the ministers to talk about possible solutions|.

Concerning the source, this is often a noun phrase, however, attributes,

appositives (78) or relative clauses need to be considered as they might be

necessary to the characterisation of the entity they refer to. Other times this

material constitutes a colourful description (79) which does not help identifying the

source referent and would just make the annotation less neat and manageable.

(78) Per il presidente dei deputati progressisti, Luigi Berlinguer, la

maggioranza <<ha fatto una proposta di natura consociativa che abbiamo


- 56 -

rifiutato…>> (ISST sole013)

For the president of the progressive delegates, Luigi Berlinguer, the

political caucus <<has made a associative proposal that we have

refused…>>

(79) “… Poi stasera torno a Zagabria”, grida Kasim Zdionica, un signore con

una pancia enorme, le ciabatte di gomma e un pugnale infilato nella

cintura. (ISST cs030)

“… Besides, this evening I’ll go back to Zagreb”, shouts Kasim Zdionica, a

men with a huge belly, plastic slippers and a dagger inserted in the

belt.

The span to be included in the cue itself is also sometimes unclear. Although the

verb, noun, preposition, etc., functioning as textual anchor of the attribution

relation are not difficult to recognise, there might be supplementary information

necessary to the characterisation of the context in which the relation takes place

such as a temporal specification, a reference to the situation or entity (80) the

content refers to and so forth.

(80) PARLANDO DI VERGA, Pirandello scriveva: i siciliani, quasi tutti, hanno

un’istintiva paura della vita. (ISST els034)

WHILE TALKING ABOUT VERGA, Pirandello wrote: Sicilians, almost all of them,

have an instinctive fear of life.

Deciding what to include and what to leave out of the annotation has not only to be

done taking into account the relevance of each element to the interpretation of the

content, but also considering the difficulty this could cause to the annotation

therefore making the task of the annotators more complicated and uncertain.

Suggestions concerning how to deal with this issue are reported in (6.1).


- 57 -

3.3 Summary

In this chapter attribution has been analysed in order to highlight its constitutive

elements and some problematic characteristics it possesses which are of

particular interest for the annotation. Attribution relations can be considered as

being composed of three constitutive elements: the content, representing the

attributed material; the source, which is the entity the content is related to; and the

cue, the textual anchor linking source and content together.

Each of these constitutive element can be expressed by a number of

linguistic structures which make it more difficult to describe the phenomenon in

order e.g. to allow the automatic recognition of it. Sources can be expressed by

proper nouns or pronouns but also be left implicit. The content span can range

from a single word to the entire discourse. The cue is usually a verb, but can also

be a noun, an adjective, an adverb, a preposition or prepositional group, a

graphical device (i.e. punctuation) or even a grammatical one (i.e. quotative

conditional).

To this complex scenery it is also necessary to add some features and

problematic issues that need to be considered when developing the annotation

schema. First of all, attribution relation can recursively nest into each other.

Moreover, it can happen that a level of nesting is not made explicit and the relative

source is added as a ‘source of the source’ representing the means through which

a more embedded source expresses the content. Other times a speech event is

presented from the hearer’s perspective, therefore leading to a change in the roles

with the perception of the information being attributed to the hearer and its source

being expressed as a specification of its provenance.

Another issue involves the occurrence of multiple sources, contents and

even cues as part of the same attribution relation. Furthermore, attribution is

heavily intertwined with co-reference and the understanding of attribution relations

is subsequent to the resolution of anaphora and co-references. A last challenge is

determined by the definition of the scope as the text span to include in each of the

three components of attribution needs to be defined so that elements important for

the interpretation of the content or the identification of the source are not left out,

but also without making the annotation too complex or too arbitrary, thus

decreasing the interannotator agreement.

4 Features to Include in the Annotation

- 58 -


The annotation of an attribution relation basically requires to mark the link between

source and content. However, additional features could also be included in the

annotation which would provide useful information about the nature and veracity of

this relation. This features, or attributes, have been derived from the PDTB

annotation scheme for attribution (Prasad et al., 2007). As presented in (2.4.3), the

scheme includes the attributes of ‘type’, ‘source’, ‘determinacy’ and ‘scopal

polarity’.

After analysing the phenomenon of attribution, however, the PDTB scheme

had to be partially modified and adapted in order to suit the present project. In the

following chapters each feature that has been included in the annotation will be

presented and the values it can assume discussed with the help of examples from

the ISST corpus.

4.1 Type

The feature ‘type’, marking the type of attribution, has been included in the

annotation schema employed for the pilot without any changes from the PDTB.

The type, which is anchored to the cue, namely determines the kind of attitude the

speaker holds towards the content of the relation. This, as in the PDTB scheme

(Prasad, Miltsakaki et al., 2008) can assume four values: ‘assertion’, ‘fact’, ‘belief’

and ‘eventuality’. The distinction seems quite viable, especially if compared to the

more fine-grained categorisation adopted by Wiebe (2002; Wiebe et al., 2005) for

the annotation of speech events and private states: ‘assertions’ (writing or

speaking), ‘opinions’, ‘beliefs’, ‘thoughts’, ‘feelings’, ‘emotions’, ‘goals’,

‘evaluations’ and ‘judgements’. However, some issues arose that would suggest a

revision of this classification before applying it to the whole corpus.

4.1.1 Assertion

Assertions are conveyed by verbs of communication, e.g. ‘dire’ (to say), ‘affermare’

(to claim), ‘riferire’ (to relate), ‘spiegare’ (81) (to explain), and suggest that the

attribution content has been verbally expressed, in writing (82) or speaking (81).


- 59 -

(81) Ha spiegato Sciandri dopo l’arrivo: “Ho imparato dagli errori del passato,

quando spesso esitavo troppo prima di partire…” (ISST cs082)

Sciandri explained after the arrival: “I’ve learnt from past mistakes, as when

I was hesitating too much before starting”.

(82) L’obiettivo, dice sempre il comunicato dell’Olp, <<è quello di assicurare

una gestione trasparente e altamente professionale delle risorse

palestinesi>>. (ISST sole023)

The goal, says the PLO release, <<is that of guarantying a transparent

management and highly professional of the Palestinian resources>>.

4.1.2 Belief

Beliefs are associated with verbs expressing a mental attitude, such as ‘pensare’

(to think), ‘credere’ (83) (to believe), ‘immaginare’ (to imagine). The content in this

case reflects a mental orientation more than conveying an event and it also

expresses a slightly lower level of factuality as while the content of assertions is

presented in a factual way, beliefs bound the content to a point of view (83), an

opinion without pretence of being generally valid.

(83) Ø credo che vivesse nella villa dei Pietroiusti anche d’inverno. (ISST re118)

I think that she was living in Pietroiusti’s villa also in winter.

4.1.3 Fact

Facts are the attributions of the reception of a speech act or of the knowledge of

an information whose truth is not questioned. Cues in this category include verbs

of perception, e.g. ‘sentire’ (84) (to hear), ‘vedere’ (84) (to see), and verbs

expressing a knowledge such as ‘sapere’ (to know), ‘ricordare’ (85) (to recall),

‘rimpiangere’ (to regret).

(84) Ø abbiamo visto e sentito, assieme, un’antica ira e uno stato di grazia.

(ISST re011)


- 60 -

(We) have seen and heard, contemporarily, an ancient anger and a

condition of grace.

(85) Era di ottimo umore, ricorda Francesco. (ISST els077)

She was in a very good mood, recalls Francesco.

4.1.4 Eventuality

Eventuality conveys instead an intention the source holds towards the content.

This group is quite heterogeneous and includes, under the label of ‘control verbs’,

these three classes (Sag and Pollard, 1991:65): verbs of the order/ permit type,

with the source trying to influence another agent to perform what is in the content,

e.g. ‘ordinare’ (to order), ‘consentire’ (to allow), ‘proibire’ (86) (to forbid); verbs of

promise, e.g. ‘promettere’ (87) (to promise), ‘accettare’ (to accept), ‘accordarsi’ (to

agree), expressing the commitment of the source towards performing a certain

action; and verb of the want/ expect type, e.g. ‘desiderare’ (to desire), ‘sperare’

(88) (to wish), ‘volere’ (to want), expressing a mental orientation of the source.

(86) E le autorità di Zagabria hanno proibito ai giornalisti di andare a Petrinja e

nelle altre località appena riconquistate. (ISST cs030)

And Zagreb authorities have forbidden journalists to go to Petrinja and the

other just reconquered places.

(87) Il governo di Zagabria smentisce seccamente e promette di “punire i

responsabili” se venissero portate delle prove del fatto. (ISST cs031)

The Zagreb government sharply denies and promises to “punish the

responsible people” in case evidence of the deed would be provided.

(88) Gli operatori del mercato fisico sperano che la chiusura americana segni

la fine dell’esplosivo rialzo delle quotazioni. (ISST sole150)

The listed exchange operators hope that the American close could mark

the end of the explosive price rise of the quotations.


- 61 -

4.1.5 Issues Concerning Type Definition

The definition of the ‘type’ feature presents some problems. First of all, it refers

only to verbal cues, while the textual anchor signalling an attribution relation can

be expressed by different means as listed in (3.1.3), e.g. prepositions, nouns,

punctuation. The latter is employed to report direct speech and can be therefore

interpreted as indicating an ‘assertion-type’ attribution. Nouns are often deverbal,

e.g. ‘suggerire’ > ‘suggerimento’ (suggestion), ‘permettere’ > ‘permesso’

(permission), ‘comunicare’ > ‘comunicato’ (82) (release), and generally easily

referable to the verb they implicitly involve, e.g. ‘pensiero/ idea’ (thought/ idea) >

‘pensare’ (to think), ‘parola’ (word) > ‘dire/ scrivere’ (to say/ write). Prepositions

instead do not explicitly specify the type of attitude the source holds towards the

content. However, it could be argued that they express an opinion, a point of view

( (89), (90)), although derived from an assertion.

(89) …secondo indiscrezioni avrebbe sostenuto davanti agli investigatori che

non intendeva fare nulla di male e che per lui si è trattato di un “gioco”.

(ISST cs004)

…according to indiscretions he would have told the examining magistrates

that he didn’t intend doing anything bad and that for him it was just a

“game”.

(90) Secondo il giornale gli Stati Uniti sperano di siglare un <<memorandum di

intesa>> sul programma <<Sdi>> con Italia, Israele e Giappone entro la fine

del 1986. (ISST els015)

According to the newspaper the United States hope to sign a

<<memorandum of understanding>> concerning the <<Sdi>> program with

Italy, Israel and Japan by the end of 1986.

All types of attribution presuppose however some kind of assertion allowing the

entity reporting the attribution relation to acquire the information. In the example

(91) the content represents the thought of some people, however, it does not

mean that this was acquired through mind-reading techniques. It is implicit that it is

possible to learn about opinions if they are expressed, usually through assertions,


- 62 -

but also using other means of communication, e.g. facial expressions. More

strikingly this bound connecting assertions and beliefs is clear with self-attributions

as in (92). A speaker or writer wanting to express a personal belief has to assert it.

The source in (92) believes the assertion expressed by the content but at the

same time is saying it. Similarly also wills, intentions, orders, etc., more or less

directly presuppose an assertion.

(91) “…C’è gente che pensa siamo professionisti super pagati e invece la

situazione è molto diversa.” (ISST cs077)

“…There are people who think that we are super paid professionals instead

the situation is very different.”

(92) “…Ø credo anche che forse convenga parlarsi tra le parti prima di spedire

lettere”. (ISST re012)

“…(I) also believe that maybe it would be appropriate for the parties to talk

to each other before sending letters”.

On the other hand, assertions quite often reflect what the source is thinking as the

two attributions in (93). In the example, what the sources say, in quotes, is also an

expression of their opinion. Less common are attributions like (94) where the

assertion itself is what matters and the content is not an expression of the source’s

thought but just the sequence of words ‘she’ pronounced, namely the attention is

on the cue rather than on the content. The verbal cue in (93), ‘dicono’ could have

been substituted by the entity reporting these two attributions with ‘pensano’

(think).

(93) “S’é pentita d’aver rotto il silenzio” dicono alcuni. “L’hanno costretta”,

dicono gli altri. (ISST period005)

“She regretted having broken the silence” say some. “She’s been forced”,

say the others.

(94) …Shana, meglio ricordata per la pubblicità dove Ø dice: “Toglietemi tutto

ma non il mio Breil”… (ISST re028)


- 63 -

…Shana, better remembered for the commercial in which she says: “Take

everything away from me but my Breil”…

Another issue is determining the type when different types of cue co-exist. This is

different from multiple cues (95) (3.2.3), which should be analysed as separate

attribution relations. Relatively often a direct quotation occurs combined with a

verbal cue other than assertion, sometimes providing an interpretation of the

content (3.2.3). In the example (96) the quotes suggest that the content

corresponds to reported direct speech, therefore an assertion, while the cue

‘promise’ refers to an eventuality, of the kind expressing a commitment.

(95) The men can defeat immunities that states often assert in court by showing

that officials knew or should have known |that design of the structure was

defective| and |that they failed to make reasonable changes|. (PDTB 1160)


cs030)


The strategy adopted here for these cases of composite cues of different types is

to give priority to the punctuation. A direct quote is surely the most reliable of the

attributions as the content is reported without any mediation. Moreover, the

assertion precedes the attitude expressed by the other cue as this was derived

from the semantic of the content. In (96) what the ‘Croatian generals’ said was

perceived as a promise, at least by the journalist reporting the information. With

establishing the predominance of punctuation, these instances would be classified

as ‘assertions’. Consequently manner verbs, with implicit general reportive verbs,

functioning as cues in combination with quotes (3.1.2), e.g. ‘sorridere’ (97) (to

smile/ to say while smiling), will also be classified as ‘assertions’ avoiding possible

confusion.

(97) Arlacchi sorride: “Pura paranoia politica. Non ho partecipato ai lavori solo a

causa di un impegno privato…”. (ISST re095)


- 64 -

Arlacchi smiles: “Pure political paranoia. I didn’t participate in the works

only because of a private appointment…” .

A last issue is determined by the semantic of the verb cues. On one hand because

the myriad of attribution verbs cannot be always unquestionably assigned to one

of the four possible types. While verbs like ‘dire’ (to say), ‘pensare’ (to think),

‘sapere’ (to know), ‘volere’ (to want), are quite prototypical and central to their

category, other verbs such as ‘criticare’ (to criticise), ‘avvertire’ (to warn), ‘leggere’

(to read), ‘elogiare’ (to praise), ‘suggerire’ (to suggest), are more peripheral und

uncertain. On the other hand, a conspicuous number of verbs are polysemous and

can belong to one or the other type according to which of its meanings is currently

at use. This can only be determined by the context as in ( (98), (99)). The same

verb cue ‘sostenere’ assumes in (98) an assertive function, corresponding to

‘claim’, while in (99) it expresses a commitment, meaning ‘support’, which

represents an ‘eventuality’.

(98) Il governo di Zagabria, invece, sostiene che sono “solo” 100 mila le

persone in cammino. (ISST cs031)

Zagreb government claims instead that they are only 100 thousand the

people who set out.

(99) Ma ieri sera I parlamentari serbi hanno “sostenuto senza riserve” la

decisione di Karadzic. (ISST cs034)

However yesterday evening the Serbian parliamentarians have

“supported wholeheartedly” Karadzic’s decision.

The issues presented in this chapter partly arose from the pilot annotation, partly

from previous considerations and from the attempt to list and classify attribution

cues (6.3). Although the ‘type’ classification has been adopted unchanged for the

pilot in this study, the problems it arises strongly suggest testing its feasibility with

evaluating the inter-annotator agreement it determines and eventually introduce

some changes.


- 65 -

4.2 Source

The source is one of the key components of the attribution relation and as such it

is marked in the annotation. It can occupy any position, i.e. before, around, after or

in between, with respect to its content and can be expressed by a number of

elements (3.1.2). All the variation in their linguistic realisation aside, the entities the

sources refer to can be very different and this deeply affects the content, hence

the need of retrieving this relation. The annotation could therefore mark a basic

distinction of source types which would facilitate evaluating their reliability or

relevance. The source type has been included in the annotation schema and can

assume the same values as in the PDTB. These are: ‘writer’, ‘other’, and

‘arbitrary’.

Aikhenvald (2004:64) distinguishes between QUOTATIVE, that is reported

information having an overt reference to the source, ‘writer’ and ‘other’ are of this

kind, and HEARSAY, referring instead to reported information without an overt

reference to those it was reported by. The source of a hearsay takes the value

‘arbitrary’ in the annotation.

4.2.1 Writer

The writer is the default source of any journalistic text, and he or she holds the

shallowest level of attribution, the content being the entire news article. Relatively

often, at least in Italian newspapers, authors are not even explicitly mentioned, or

they are recalled by just their initials. Even when they are mentioned, writers are

never part of the article body as, similarly to any other attribution relation, the

source is usually not part of the content it holds but occupies an external or

peripheral position with respect to it.

Unlike the PDTB, where discourse connectives and their arguments are

always attributed even without an explicit attribution relation, therefore most of the

attributions are to the writer, this will be here left implicit in order to simplify the

annotation process. The writer is external to the article intended as a discourse

unit and usually not the only external source involved. Apart from the writer of the

article, the newspaper publishing it could be considered another source and even

the website reporting it, in case of news published on the web. The attribution of

the entire news article to the writer and subsequently to the newspaper should be


- 66 -

easily inferable and can be added in a second time if needed.

Nonetheless, in case the writer is directly and explicitly reporting his or her

opinion or words, the annotation should mark the writer as the source. By explicitly

mentioning himself, the writer presents information in a less factual way making

explicit that it is not shared knowledge but a personal point of view he is

presenting ( (100), (101)).

(100) È questo a mio parere il dato politico-sociale rilevante: … (ISST re085)

It is this in my opinion the relevant socio-political data: …

(101) Un arbitro corrotto caro Brera, è possibile che in tanti anni di calcio non sia

venuto fuori il nome di un arbitro corrotto? Io non ci credo. (ISST els027)

A corrupted referee dear Brera, is it possible that in many years of football

no name of a corrupted referee has come up? I don’t believe it.

4.2.2 Arbitrary

As ‘arbitrary’ should be marked all those sources which do not really attribute the

content to a specific entity or to an entity having a real referent in the world. In this

category fall impersonal sources such as ‘si’ (102)/ ‘uno’ (one), personal and

indefinite pronouns used as impersonals ‘tu’ (you), ‘qualcuno’ (someone) (103),

‘nessuno’ (no one), relative pronouns, e.g. ‘chi’ (who) (104), and missing sources,

like with verbal moods having no explicit subject, e.g. ‘infinito’ (infinitive), ‘gerundio’

(gerundive), and passive constructions (3.1.2) with omitted agent.

(102) Spesso in questi casi si dice la mobilitazione popolare é più importante di

mille altre ricerche. (ISST els032)

Often in these cases one says that the popular intervention is more

important than thousands of other investigations.

(103) Qualcuno pensa che questo sia un quartiere privilegiato. (ISST cs092)

Someone thinks that this is a privileged district.


- 67 -

(104) C’è chi sostiene che stiamo vivendo il ritmo giusto di un mercato azionario

come quello italiano, considerato la sua dimensione e le sue strutture.

(ISST els055)

There is who claims that we are living the right pace of a share market like

the Italian one, considered its size and its structures.

Also as ‘arbitrary’ can be used personal plural pronouns, i.e. ‘noi’ (we) ‘voi’ (you)

‘loro’ (they), or indefinite pronouns, e.g. ‘tutti’ (everyone), ‘molti’ (many) (105), but

also collective nouns, such as ‘la gente’ (the people) (106), referring to an

indistinct plurality. Especially with plural impersonals the effect achieved is often

that of attributing the content to everyone, as if this was some kind of general truth,

the expression of common sense or general knowledge (107).

(105) … <<domani questa stessa gente é pronta a scendere in piazza per

rivendicare>> dicono e scrivono in molti. (ISST els063)

…<<tomorrow the same people are ready to take to the streets to claim>>

many say and write.

(106) “…C’è gente che pensa siamo professionisti super pagati e invece la

situazione è molto diversa.” (ISST cs077)

“…There are people who think that we are super paid professionals instead

the situation is very different.”

(107) Tutti gli esseri umani sanno di poter essere più di ciò che sono. (ISST

cs012)

Every human being knows they can be more than what they are.

Indefinite pronouns however are not ‘arbitrary’ when their referent is restricted by a

specification (108) or they assume an adjectival role as in (109).

(108) Ma, con una reazione molto comune in casi del genere, nessuna delle

vittime ha pensato ... (ISST els072)


- 68 -

However, having a very common reaction in similar cases, no one of the

victims thought …

(109) La decisione di convocarla fu presa domenica 23 dicembre, dopo che

alcuni ministri affermarono: <<sentiremo l’opinione dei familiari e

decideremo>>. (ISST els048)

The decision of convoking it was made Sunday, 23 December, after that

some ministers affirmed : <<we will listen to the relatives’ opinion and

decide>>.

Another group of arbitrary sources is formed by those nouns referring to

‘containers’ or means of information, such as ‘voci’ (voices/rumors) (110),

‘resoconto’ (report), ‘indiscrezione’ (indiscretion) (111), ‘proverbio’ (proverb) etc…,

when the entity producing them is not expressed.

(110) In Italia si è fermi ai progetti e alle intenzioni, nonostante le voci che da

anni pronosticano l’avvento di Warner o Paramount nella gestione di sale.

(ISST sole036)

In Italy we are still at projects and intentions, despite the voices that since

years predict the arrival of Warner or Paramount in the management of

movie theatres.

(111) Secondo indiscrezioni la prima segnalazione è stata inviata alla Procura

della Repubblica. (ISST cs015)

According to indiscretions the first report has been sent to the Public

Prosecutor’s office.

‘Arbitrary’ is a very informative attribute which allows distinguishing between

attributions to a real referent, which are labelled as ‘other’, and attributions whose

source is not really clear. Having this data marked, it is possible to decide whether

to include these attributions when considering the content or just leave them out

as if they were just a device the above source is employing to take the distance

from the content and from the responsibility deriving from being its direct source.


- 69 -

In case information with a traceable source are searched, contents having an

‘arbitrary’ referent could be automatically discarded as they do not meet this

requirement and cannot be verified. On the other hand, when looking for general

truths, rumours about previsions, moods concerning an event and so forth,

‘arbitrary’ sources are particularly relevant.

4.2.3 Other

The value ‘other’ is associated with those sources which refer to a specific entity.

This is often a proper noun of a person, e.g. ‘Kasim Zdionica’ (112), ‘Angela

Merkel’, or an organisation, e.g. ‘the Parliament’, ‘The Times’, etc... The specific

referent can be mentioned also somewhere else in the article and recalled

(bridging anaphora) by a general noun or pronoun in the attribution relation as in

(113).

The borderline between ‘arbitrary’ and ‘other’, however, is far from being

sharp. Sources can be more or less generic and detectable. Common nouns

sometimes refer to an entity whose identity can be more or less easily

reconstructed, such as ‘the president’ when taking about a specific company, ‘the

judge’ referring to a precise trial, ‘Angelina Jolie’s husband’, and so forth, but other

times this term is too generic to really allow identifying its referent.

(112) “… Poi stasera torno a Zagabria”, grida Kasim Zdionica, un signore con

una pancia enorme, le ciabatte di gomma e un pugnale infilato nella

cintura. (ISST cs030)

“… Besides, this evening I’ll go back to Zagreb”, shouts Kasim Zdionica, a

men with a huge belly, plastic slippers and a dagger inserted in the

belt.

(113) La Fermenta, a sentire l'arabo, è organizzata in modo che oggi consegue

un utile pari al 35 per cento del fatturato. Questo il vero traguardo che dovrà

nel tempo raggiungere la Pierrel. Ma come? Con tagli di mano d'opera?

Nemmeno per sogno, dice El Sayed. (ISST els001)

Fermenta, according to the Arabian, is organised so that it earns at present

a profit of 35 per cent of the turnover. This is the real goal that in the long


- 70 -

distance Pierrel will have to achieve. But how? Cutting down on workforce?

No way, says El Sayed.

Although included in the present study as ‘other’, common names such as

‘residente’ (resident), ‘passante’ (passer-by), ‘donna/ signora’ (woman) (114),

‘esperti’ (experts), whilst referring to a specific referent in the real world, represent

general terms which do not allow any identification or characterisation of the

source. In the example (115) the journalist is trying to give a characterisation to

this unknown referent he is quoting by adding a detail about the way she was

dressed, i.e. ‘in grey’, as if this would make the lady recognisable. However, these

sources are not to be confused with ‘arbitrary’. ‘A lady in grey’, unless the writer is

lying, is not a generic entity, a plurality, a hearsay, but a specific human being in

the real world, as the man in (112).

It is desirable to provide the final annotation with an additional distinction

that can account for this type of source, introducing an additional value for the

‘source’ feature such as ‘common’.

(114) Una donna afferma di aver assistito all’uccisione a sangue freddo del

marito. (ISST re084)

A woman claims she has witnessed the cold blood killing of her husband.

(115) <<Voto no>> diceva una signora in grigio <<tanto c’è già chi ha deciso

per noi>>. (ISSTels048)

<<I vote no>> was saying a lady in grey <<anyway there is already who

has decided for us>>.

4.3 Factuality

Hunter et al. (2006) remark that many reportative verbs have, in addition to their

intensional use, an evidential use, such as the one making ‘B’ in (116) an

appropriate answer to ‘A’. They argue that theories of discourse interpretation

should account for these different uses. According to their analysis, the intensional

use is conceptually primary and the evidential use derives from it. They therefore


- 71 -

introduce two discourse relations in order to account for the different

interpretations: an evidence relation and an attribution relation. While evidence is a

subordinating relation veridical in both arguments, with attribution is the embedded

clause that is subordinate to the main claim and it is non-veridical with respect to

the right argument.

(116) A: Why is John absent from the meeting?

B: Sharon said that he is out of town.

(Hunter et al., 2006:99)

When considering the factuality of an attribution relation, it should be clear to

which of these two uses it refers to. The evidential relation is true depending on

the veracity of the evidence, which is the information in the content of the

attribution relation (116). The attribution is instead true if the actual relation source-

content via the attitude expressed by the cue is real, e.g. the assertive event in

(116) really took place. The factuality of an attribution relation, however, does not

entail the factuality of its content. Sources can in fact lie or just be wrong.

As the intentional use precedes the evidential use, the content of an

attribution relation can constitute evidence for something else only if the attribution

relation itself is factual. While it is very complex to account for the factuality of the

content, the factuality of the attribution relation can be syntactically computed

considering the cue and the source or other elements scoping over it. Some

information about the factuality of the content can be derived from the type of cue,

as suggested by Prasad, Miltsakaki et al. (2008:44). The feature ‘factuality’

accounts here for the factuality of the attribution relation only, marking the fact

whether this relation really exists, i.e. answering the question: is this content really

presented as attributed to this source?

In their account of event factuality, Saurí and Pustejovsky (2008) distinguish

among situations presented as corresponding to real situations in the world,

situations which instead are unreal, and uncertain situations. They characterise

factuality as involving polarity and epistemic modality, which could be defined as

the commitment of a source towards the content of a proposition. Polarity takes

two values, i.e. positive and negative, while epistemic modality can assume a


- 72 -

range of values varying from absolute certain to uncertain. The combination of

these two features determines a range of factuality values (Table 1).

Positive Negative Certain Fact Counterfact Probable Probable Not probable Possible Possible Not certain

Table 1 - Factuality values (Saurí and Pustejovsky, 2008)

For the present annotation scheme, the factuality of the attribution relation can

assume only two values: ‘factual’ and ‘non-factual’. The first accounts for the

attribution relation being presented as a fact in the world (certain and positive).

‘Non-factual’ represents underspecified factuality and should not be confused with

counterfactual. It accounts in fact not only for attributions presented as not real,

but also includes the intermediate values expressing different degrees of

possibility and probability. Further distinctions of the ‘non-factual’ value of the

factuality attribute are left for future developments of the annotation schema.

4.3.1 Factual

The factuality of the attribution relation is marked in the PDTB with the

‘determinacy’ feature. This can take only two values: ‘indet’, accounting for the

attributions presented not as factual, and ‘null’ for the factual ones. This

substantially corresponds to the present account of factuality of attribution, with

‘null’ corresponding to ‘factual and ‘indet’ to ‘non-factual’. The term has been here

however changed to ‘factuality’ as this seems to be more specific and easily

recognisable.

In news language factual attributions occur by far most frequently as

journalists tend to report facts and to present information and events as real facts,

more than just making suppositions or hypothesis. An attribution presented as

factual may nonetheless not correspond to a real event. Whether or not to believe

the attribution relation is genuine can be decided only on the basis of the above

source, i.e. the source, or sources, of the content in which the attribution relation is

nested. In the examples ( (117), (118)) the attributions are presented as factual by

the writer. In order to postulate about the veracity of the content it is instead


- 73 -

necessary to determine whether the source in (117) ‘Evtuscenko’ is mendacious

and the source in (118) ‘the public prosecutors’ really hold the attitude towards the

content or deceived it. This could be decided with the help of the context but also

common sense and extra-linguistic knowledge contribute to the conclusion.

(117) Evtuscenko, nel suo articolo, afferma che Pasternak gli fece pervenire una

copia del romanzo poco dopo la sua prima pubblicazione. (ISST els076)

Evtuscenko, in his article, claims that Pasternak sent him a copy of the

romance shortly before its first publication.

(118) Monreale, i pm vogliono Cassisa alla sbarra. (ISST re124)

Monreale, the public prosecutors want Cassisa before the bar.

The analysis of the content factuality as well as of the source trustworthiness is

complex as it is not inferable from the syntax and grammatical features. The

factuality of the attribution itself is instead easily determined: the source should be

an entity and the cue should not be in the scope of a negation or an element

expressing uncertainty or probability.

4.3.2 Non-factual

Non-factual attributions can be considered as negated attributions, they namely

express that there is no link between source and content or that this link is just

hypothetical. It could be argued that these instances do not hence represent

attribution relations and could be left out of the annotation. However, they

nonetheless convey relevant information and have been for this reason included in

the annotation. Non-attributions can, for example, correct false attributions or just

remark that there is no link between that particular content and a specific source

(119).

(119) John is under investigation. The police, however, haven’t said that he is

presumed guilty.


- 74 -

Non-factual attributions expressing possibility or probability, moreover, are very

useful when there is an interest in retrieving hypothesis or previsions and not just

facts. Modal verbs can be employed to express an attribution which is just

possible, desired, ordered or urged (120). However, this is not the case when the

source is in the first person as it then reflects more an idiomatic use as in (121).

While the source in (120) has never really asserted the content, the one in (121)

necessarily did.


distruzione di tutti gli armamenti nucleari.” ISST cs039)


nuclear armaments.”

(121) No, Ø devo dire anzi che in queste prime due settimane il mondo sindacale

è stato in attesa e mi auguro che sia possibile intessere un dialogo forte.

(ISST sole011)

No, on the contrary (I) have to say that in these first two weeks the union

world has been lying in wait and I wish that it will be possible to intertwine a

strong dialogue.

Similarly, when the cue is in the scope of a conditional or part of an hypothetical

sentence (122), the attribution should be marked as non-factual. Other structures

or contexts making an attribution non-factual are: the imperative, usually with

verbs of belief and assertion such as ‘think’, ‘imagine’, but also ‘say’ and ‘admit’;

interrogative forms, as in (123); the future tense (124), (125), as an event

happening in the future is not yet a real event and it is not certain it will ever

become one; and the infinitive used to make a conjecture as in (126).

(122) Se Ø vuoi che il fast relax sia davvero efficace tieni d’occhio l’orologio e

scegli: l’intervallo di pranzo e il ritorno a casa. (ISST period003)

If (you) want that the ‘fast relax’ is really effective keep an eye on the watch

and choose: the lunch break and the homecoming.


- 75 -

(123) Pensa anche lei come tanti critici che, con il suo romanzo incompiuto, lo

scrittore si trovasse a una svolta esistenziale? (ISST els034)

Do you also think like many literary critics that, with his unfinished romance,

the writer was at an existential turning-point?

(124) E Ø diranno all’ONU che il problema dei profughi non li riguarda. (ISST

re084)

And (they) will tell the UN that the refugee problem does not concern them.

(125) E naturalmente molti diranno che ha usurpato il posto in finale. (ISST

els062)

And surely many will say that it has usurped the presence in the final.

(126) It is silly libel on our teachers to think they would educate our children

better if only they got a few thousand dollars a year more. (PDTB 1286)

The presence of a grammatical cue, i.e. the quotative conditional (see 3.1.3) could

be also taken as a sign of uncertainty, as in the example (127). However, although

often related to epistemic modality, and therefore involving some degree of

uncertainty, the quotative conditional is a sign of an additional level of attribution,

namely a level of nesting left implicit. In (127) what the quotative conditional

expresses is not uncertainty about the attribution. The uncertainty is a

consequence of the quotative conditional which, scoping on the cue, presents the

attribution relation as second hand material, similarly to hearsays. Attributions

including a quotative conditional should be therefore considered factual.

(127) Manlio Averna avrebbe infatti riferito al pm che, in base agli accertamenti

finora effettuati, è molto improbabile che Castellari si sia sparato. (ISST

sole016)

Manlio Averna has told (QUOT.COND) the public prosecutor that,

according to the verifications done till now, it is very unlikely that Castellari

shot himself.


- 76 -

Apart from being connected with the cue, non-factual attributions are also found

when the source is negated as in (128). The attribution to no-source is not linking

the content to any entity and therefore is non-factual.

(128) Nessuno parla più di baratro imminente e di crisi finanziaria. (ISST cs025)

No one is talking anymore about imminent precipice and financial crisis.

4.4 Scopal Change

It is not always the case that an attribution cue in the scope of a negation is non-

factual. It is possible for example that a negative particle affecting a verbal cue on

the surface, reverses instead the polarity of the content. This feature is included in

the PDTB (Prasad, Miltsakaki et al., 2008:46) with the name of ‘scopal polarity’.

Annotating this feature is not essential in order to account for the attribution

relation, however, it is crucial for the interpretation of the content. The feature

takes two values: ‘scopal change’ and ‘none’. In case an attribution is factual, but

its cue is in the scope of a negation, presumably the negation is affecting the

content and not the relation itself. If it could be possible to separately determine

the scope of negations and other elements this would be preferable and the

‘scopal change’ feature would be no longer needed.

4.4.1 Scopal Polarity

Most commonly, the scopal change affects the polarity of the content. The surface

negation can be expressed syntactically (i.e. don’t say, don’t think), or lexically,

e.g. ‘negare’ (to deny), ‘escludere’ (to exclude), ‘smentire’ (to deny) . Lexical

negations which are part of the verb semantics, as in the example (129) below, are

always scoping on the content of the relation. The relation between the ‘Croatian

government’ and the contents it holds is factual and it could be changed into:

‘…the Croatian government affirms that they have NOT been banned and affirms

also TO HAVE NO ethnic cleansing intention in the newly conquered areas’ or

alternatively ‘…says it is not true that…’.

(129) Qualunque sia il numero di sfollati, il governo croato nega che siano stati

espulsi e nega anche qualsiasi volontà di pulizia etnica nelle regioni appena


- 77 -

riconquistate. (ISST cs031)

Whatever the number of evacuees, the Croatian government denies that

they have been banned and denies also any ethnic cleansing intention in

the newly conquered areas.

In case of a double negation as in (130), containing a syntactic and a lexical

negation, the first scoping on the verb and the second on the content, the result is

again a positive, and therefore factual, attribution. ‘Not deny’ corresponds to

‘affirm’ and as the negation scoping on the verb is changing its semantics, its

reversed reading can no longer affect the polarity of the content. In these cases

the annotation should assign the feature ‘scopal change’ the value ‘none’.

(130) Ieri circa mille giovani hanno lasciato la città, ma la polizia non esclude che

possa esserci qualche altra esplosione di violenza. (ISST cs037)

Yesterday around a thousand young people have left town, but the police

don’t exclude that there could be some other act of violence.

Scopal changes do not occur with the verbs of the ‘fact’ type (131) as noted by

Kiparsky and Kiparsky (1971). They can occur however with the other types of cue

and relatively often with ‘beliefs’. Determining whether an attribution is non-factual

or there is a change in the scope of the polarity is often problematic. In the

example below (132) the attribution relation contains a ‘no-entity’ source, ‘no one’,

and should therefore be non-factual. However, ‘no one would like that to happen in

their town’ could be also rewritten as ‘everyone would like that not to happen in

their town’, involving a change in the polarity from the source to the content. Are

these sentences equivalent? Probably not. The correspondence is especially

difficult with wills or intentions: not wanting something does not exactly correspond

to wanting the opposite.

(131) Ma lui si strapazza, lavora troppo, Ø non ha capito che deve stare più

attento. (ISST cs059)

But he tires himself out, he works too much, (he) hasn’t understood that he

has to take more care of himself.


- 78 -

(132) Strano destino, quello di Civitavecchia: finire spesso, troppo spesso, sulle

pagine dei giornali per eventi misteriosi, oppure per fatti che nessuno

vorrebbe accadessero nella sua città. (ISST cs090)

Strange destiny, that of Civitavecchia: ending up often, too often, in the

news because of mysterious events, or because of events that no one

would like to happen in their town.

Part of the problem derives from the fact that ‘beliefs’ and some ‘eventualities’ do

not refer to events like assertions. While negating an event makes it non-factual, a

negative belief or will does not cancel the attribution relation: a negative mental

state is still a mental state.

With including non-factual attributions in the annotation, the issue of determining

the presence of a ‘scopal change’ in order to account for the veracity of the content

is less crucial. Uncertain instances, those still involving an attribution, therefore not

completely non-factual, and not exactly attributing the negation of the content,

hence also not involving a real scopal change, could be annotated according to

two strategies.

One possible solution would be that of marking them as ‘non-factual’ since

the attribution of the content does not actually take place. In this case, the

‘factuality’ attribute would be restricted to the veracity of the attribution of the

unchanged content to the source. This ‘non-factual’ attribution could still suggest,

however, that the reverse of the content, or a different content is presupposed. In

case this solution is adopted, the content of a non-factual attribution should be

more carefully considered as it could still carry useful information.

On the other hand, the opposite strategy could be adopted and the

attribution marked as ‘factual’ but involving a scopal change. In this case it should

be clear that the change is not implying the exact reverse of the content polarity,

but just that the negation is not really scoping over the attribution relation itself,

and that the content or the attitude the source holds are affected by it. With

choosing this solution it should be clear that the ‘scopal change’ attribute does not

necessarily reverse the polarity of the content.

Since in some cases it is not possible to determine, despite the help of the


- 79 -

context, if the attribution relation itself is negated or just the attitude the source

holds, e.g. John doesn’t want to become president (he never expressed this

intention/ he expressed a negative intention towards becoming president), the first

strategy seems more appropriate. The annotators should be invited to decide

whether they perceive an existing attitude, positive or negative, the source holds

towards the content and in case this is not clear they should mark the attribution

as ‘non-factual’. Analysing the inter-annotator agreement it will be then possible to

determine whether the issue of ‘scopal change’ requires further clarifications.

4.4.2 Other Elements Affecting the Factuality

‘Scopal polarity’ (PDTB annotation) has been in the present annotation project

labelled as ‘scopal change’ as polarity is not the only element affecting the

factuality, and not the only one which can change in scope and affect the content

of an attribution instead of the attribution itself. Other constructions, although

uncommon, can occur. For example, the cue could be in the scope of a condition

as in (133). However, the condition in the first clause does not mean this is

required for the attribution relation to be factual, namely for the belief event in

(133) to take place. The condition affects instead the content of the attribution and

it is part of the belief: ‘If there is a majority […] the legislature could continue’.

(133) Se c’è, cioè, una maggioranza in Parlamento in grado di affrontare

seriamente una fase di riforme anche elettorali, Ø penso che la legislatura

possa utilmente proseguire. (ISST re075)

If there is a majority at the Parliament able to seriously face a phase of

reforms, also electoral, (I) think that the legislature could usefully continue.

It is possible that other elements or constructions manifest a change in scope,

although further investigations are necessary to detect which ones and how to

recognise them. This is especially difficult because of their infrequency. The

annotation could however allow detecting other changes in scope affecting the

content.


- 80 -

4.5 Summary

Apart from annotating the spans corresponding to the three components of the

attribution relation, i.e. ‘source’, ‘cue’, ‘content’, attributes should be included in the

annotation schema which carry relevant information affecting the relation itself or

the interpretation of its content. In this chapter, these features have been

presented and confronted to the features included in the PDTB scheme, adopted

as a model for the present one.

One aspect to annotate is the ‘type’ of the cue (4.1), expressing the kind of

attitude the entity is holding towards the content: ‘assertion’, ‘fact’, ‘belief’ or

‘eventuality’. This feature provides information partially affecting the factuality of

the content and the values other features can assume, e.g. ‘facts’ do not support

any ‘scopal change’. The feature ‘type’, however, is often complex to determine as

this categorisation is partially ambiguous. Before applying it to the whole corpus,

this should be tested for inter-annotator agreement and, in case of poor score,

perfected by changing or reducing the values.

Another useful feature to be marked is the ‘source type’ (4.2). This allows a

basic distinction among: ‘writer’, ‘other’ and ‘arbitrary’. The first (4.2.1) can be

connected to information presented as the personal point of view of the writer.

‘Other’ (4.2.2) stands for a specific source corresponding to a real entity. The latter

(4.2.3), ‘arbitrary’, should be used when referring to sources without a real or

certain referent, thus labelling e.g. general knowledge, hearsays and rumours.

‘Factuality’ (4.3) allows to distinguish between real attributions,

corresponding to a real event or mental attitude in the world, and hypothetical or

unreal attribution events. The annotation of this feature enables keeping these

separate, without loosing the information carried by ‘non-factual’ attributions.

Lastly, a change in the scope affecting the content is also annotated and

labelled as ‘scopal change’. This usually affects the polarity of the content although

superficially it should involve the cue and make the attribution non-factual.

Determining when it is correct to identify a scopal change is a problematic issue.

Despite the fact that a scopal change cannot occur with cues of the type ‘fact’, this

matter needs to be addressed in context with particular attention to discerning

between negations affecting the existence of the attitude the source holds and

negations reversing instead this attitude.

5 Performing a Pilot Annotation

- 81 -


Developing an annotation schema goes hand in hand with testing it on the corpus

that is going to be annotated. The application of the schema to the corpus allows

to assess intuitions and solutions thus making more aware choices based on the

data and not only on theoretical considerations. Real language examples,

moreover, while on one hand reflect real language use, thus having few or even no

occurrences of some possible but uncommon features, on the other hand

represent a repository of special cases which do not match descriptions of general

occurrences and characteristics.

Designing an annotation schema follows a similar path as any design

process (Figure I): (1) a preliminary stage in which objectives and requirements

are defined; (2) a phase in which the problem is analysed; (3) a planning phase in

which possible solutions are presented and a subsequent (4) testing phase with

the development of a prototype. This latter leads to the identification of viable or

unfeasible solutions and the discovery of new issues. This leads to a new planning

phase, and the process gets iterated until a satisfactory solution is reached.

Figure I - Design Process

In order to perform an annotation, a suitable tool is required. The selection of the

most appropriate one to employ for the pilot annotation is the result of a detailed

requirements

1

analysis

2

planning

3

evaluation

4

release

5


- 82 -

analysis of several available tools. To be able to make such a decision, the

characteristics they should possess so as to match the annotation schema

requirements had to be identified thus allowing the definition of desired tool

specifications. These represent the basis towards the development of an

appropriate software especially designed to perform the task of annotating

attribution relations.

In this chapter, the Italian corpus to which a layer for attribution will be

added is presented and a subsection of it is sampled to be employed in the pilot

annotation. Afterwards, several tools for performing annotation are compared in

the light of the specific requirements of the current annotation schema. Eventually,

one of these tools will be selected and set before proceeding with the annotation

of a sample of the corpus, thus leading to the identification of new issues and a

partial redesign of the annotation scheme.

5.1 Corpus

The present study originates in the framework of a project aiming at the addition of

a layer for discourse to the ISST corpus. It takes, however, a different perspective,

leaving for later the analysis of discourse relations in general and concentrating

instead on attribution, which is only partially a discourse phenomenon. The ISST

corpus employed for this study is the Italian Syntactic-Semantic Treebank,

developed between 1999 and 2001 in the frame of the SI-TAL project, a

collaboration of several Italian research and university institutions with the purpose

of developing a suite of resources and tools for Natural Language Processing

applications. For the pilot annotation a subcorpus of the ISST had been selected

as described in the relevant chapter (5.1.2).

5.1.1 ISST Architecture

The ISST corpus (Montemagni et al., 2003) consists of 307.682 word tokens and

was built to reflect contemporary language use. It is formed by a collection of 484

newspaper and periodical articles published between 1985 and 1995. One section

of the corpus, about two thirds, represent general language use and contains

articles about different subjects from ‘Repubblica’, identified in the examples as


- 83 -

‘re’, ‘Corriere della Sera’ (cs), and other newspapers (els) and periodicals (period).

The other section of the corpus, about 90.000 tokens is instead specialised as it

deals with the financial domain. Articles in this section are taken from a single

financial newspaper: ‘Il Sole 24 Ore’ (sole) and were all published in 1994.

The ISST has a five level structure encoding orthographic, morpho-

syntactic, syntactic and semantic information. Only the financial section of the

corpus has been fully annotated with all five levels. The syntactic level is split into

two separate ones so as to separately account for the constituent and dependency

structures, thus providing an independent view of the same surface syntax as one

level does not presuppose the other.

The orthographic level (Figure J) contains the word tokens and information

about low or capital letters and punctuation. To each token a unique ID number is

assigned.

<w id="w_001" case="cap"> Bruxelles </w> <w id="w_002" case="low"> all' </w> <w id="w_003" case="cap"> Italia </w> <w id="w_004"> : </w> <w id="w_005" case="low"> urgente </w> <w id="w_006" case="low"> ridurre </w> <w id="w_007" case="low"> il </w> <w id="w_008" case="low"> deficit </w> <w id="w_009"> . </w>

Figure J - ISST orthographic level (sole002)

The morpho-syntactic annotation (Figure K) includes the mark-up of POS, lemma,

number, person, gender, etc…Multi-word expressions are analysed as a whole

e.g. ‘in_mezzo_a’ (between/ among), while morphologically complex words, such

as cliticised verbs are instead treated so as to account for its constitutive parts,

e.g. impedendoci > impedire + ci (prevent us).

<mw id="mw_001" pos="SP" mfeats="NN" lemma="bruxelles" sfeats="NP"

href="sole.orth002#id(w_001)"> Bruxelles </mw>

<mw id="mw_002" pos="E" mfeats="FS" lemma="a" sfeats="PART"

href="sole.orth002#id(w_002)"> all' </mw>


- 84 -

<mw id="mw_003" pos="SP" mfeats="NN" lemma="italia" sfeats="NP"

href="sole.orth002#id(w_003)"> Italia </mw>

<mw id="mw_004" pos="PU" lemma=":" sfeats="DIRS"

href="sole.orth002#id(w_004)"> : </mw>

<mw id="mw_005" pos="A" mfeats="NS" lemma="urgente" sfeats="AG"

href="sole.orth002#id(w_005)"> urgente </mw>

<mw id="mw_006" pos="V" mfeats="F" lemma="ridurre" sfeats="VIT"

href="sole.orth002#id(w_006)"> ridurre </mw>

<mw id="mw_007" pos="RD" mfeats="MS" lemma="il" sfeats="ART"

href="sole.orth002#id(w_007)"> il </mw>

<mw id="mw_008" pos="S" mfeats="MS" lemma="deficit" sfeats="N"

href="sole.orth002#id(w_008)"> deficit </mw>

<mw id="mw_009" pos="PU" lemma="." sfeats="TIT"

href="sole.orth002#id(w_009)"> . </mw>

Figure K - ISST morpho-syntactic level (sole002)

The ISST takes a distributed approach to syntax, keeping functional annotation

and constituent structure on two separate levels which can be however combined

if required. This strategy represent a more suitable way (Montemagni et al., 2003)

of describing languages like Italian having a syntactically free constituent order

and pro-drop property thus requiring the insertion of a number of empty elements

which would result in a consequent loss of annotation transparency.

The annotation of constituency (Figure L) produces shallow tree structures.

It was performed with a Shallow Parser and then manually revised. The functional

annotation is word-based and includes relations such as dependency, coordination

and intra-sentential co-reference.

[F3 [SN Bruxelles [SP a [SN Italia SN] SP] SN] F3] [CP [SA urgente SA] [F [SV2

ridurre SV2] [COMPT [SN il deficit SN] COMPT] F] CP]

Figure L - ISST syntactic constituent level (sole002)

Lastly, the ISST presents a lexico-semantic level of annotation, assigning


- 85 -

semantics tags. These convey: the sense of each word, based on the ItalWordNet

(IWT) lexical resource; special uses, e.g. idiomatic, proper nouns, neologisms,

etc…;and additional comments of the annotators.

A tool has been especially developed for the task of annotating and

combining the 5 levels of annotation of the ISST: GesTALt. The tool also provides

a visual representation of the annotation, e.g. functional annotation makes use of

graphs, while constituent structure is visualised as a strip tree. This tool is

unfortunately not open-source and could not be tested or employed for the pilot

annotation in the present study. The ISST corpus is available in a number of

formats, i.e. text, XML and CoNLL.

5.1.2 Subcorpus Selection

As attribution is a very pervasive relation in journalist language, as it is common in

newspaper article to report opinion, statements and information other people

expressed, only a part of the corpus could be annotated for the present study.

Extending the annotation to the whole ISST represents a subsequent stage which

would require employing annotators and possibly the development of a specific

tool.

In order to test the feasibility and effectiveness of the annotation schema

object of the present study, a pilot annotation was performed on a sample of the

ISST corpus. Being the financial section the only one having already all five levels

of annotation, the addition of a sixth level for discourse and attribution would be

better performed on this part of the corpus so as to have a complete resource.

However, in order to avoid interferences deriving from the specificity of the

financial domain, the selection of articles for the pilot annotation has not been

drawn only from this part, corresponding to the articles form ‘Il Sole 24 Ore’.

The subcorpus has been designed in order to be balanced with respect to

the language contained in the ISST corpus as articles from every section are

represented. Table 2 reports the total number of articles in each section (first row)

and the number of articles from that section included in the sample (second row).

A total of 50 articles out of the 484 constituting the corpus have been annotated,

representing approximately a tenth of the ISST (roughly 30.000 tokens). The

phenomenon of attribution appeared to be well represented in this subsection,


- 86 -

thus containing a wide range of occurrences of attribution relations.

Cs Els Period Re Sole

99 81 13 136 155 10 9 2 14 15

Table 2 - N. of articles selected per section

The subcorpus was obtained from a single file (Figure M), containing the whole

corpus in table format with each line corresponding to a new token, and each tab-

separated column to a different annotation feature. The first column refers to the

article ID, the second to the sentence number and the third to the word counter in

the relative article. Following columns add information about constituency, POS,

lemma and the seventh contains the tokens.

Figure M - ISST table format

In order to reconstruct the articles, so as to have them available in the text format

the tool required, the table file was split into a file each article containing the word

tokens only, divided by a single space. This was achieved with writing a few lines


- 87 -

of code in the scripting language Python. Subsequently, it was necessary to

correct some errors detected in the original file leading to an incorrect word order.

Moreover, some characters such as hyphens and angle brackets were

individuated as responsible for the crash at the launch of the tool software. In this

case it was necessary to substitute the relative ASCII character codes for the

problematic characters.

5.2 Tool Selection

A myriad of tools have been developed with the purpose of annotating NL, though

finding an existing tool perfectly matching a specific annotation project

requirements is a search which in most cases is doomed to fail. One obstacle is

determined by the availability of the tool, as due to the high costs involved in the

production of software material, some tools are commercialised. Among the many

open-source tools, developed mainly by research and university institutes and

made available for academic purposes in order to promote their use and share

resources, the great majority was developed in the frame of a specific project.

These tools do not support all the annotation requirements of another project and

their code is often difficult or impossible to change in order to adapt it to the new

task.

A last group of open-source tools supports a wider range of annotation

projects and a high level of customizability. These annotation tools were designed

not just for a specific project, but to be able to support the annotation of one or a

group of phenomena, e.g. anaphora relations, speech interactions, temporal

references, etc… However, it is unlikely that a tool generally developed for a

specific phenomenon succeeds in capturing all its possible aspects as it might take

an approach grounded in a specific theory or miss aspects which another project

wish to consider and include in the annotation.

In the frame of attribution relations, to the above mentioned issues making

the identification of a suitable tool challenging, it has to be added a more relevant

one: there is no tool especially designed to support the annotation of attribution.


- 88 -

5.2.1 Requirements

In order to find the best matching available tool it is necessary to first define what it

should match in order to support the annotation scheme, i.e. the annotation

requirements. First of all, the tool should be able to take advantage of the other

layers of annotation already available for the ISST corpus, especially to facilitate

the annotators’ task of retrieving possible annotation relations through the corpus.

For this reason, the tool should be able to read in a file like the table format (Figure

M) containing information from other layers of annotation. Only the bare text

should be displayed in order to avoid confusion, however, the tool should possess

a search function capable of retrieving e.g. the lemma of a given verb that could

be associated with attribution such as ‘say’, ‘think’ or ‘order’ or the POS of a token

in order to disambiguate between e.g. a verb and an adjective with words like

‘ordinato’ (ordered/ tidy). This would support proceeding cue by cue to annotate

attribution, strategy adopted also by the PDTB (Prasad, Miltsakaki et al., 2008) for

the annotation of discourse connectives.

Once a cue is identified it should be possible to select it and mark the

existence of an attribution relation in that point of the text. This should be done on

the cue as it represents the only constituent of attribution which is always

expressed and singularly considered (in case of multiple cues separate relations

are annotated). The relation should require the selection of one or multiple text

spans for the content and the optional selection, as the source might be left

implicit, of one or multiple spans corresponding to the source. Each element

constituting a single source, cue or content will be from now on called “markable”

(Mueller and Strube, 2001:48). As source, cue and content (134) might be

fragmented and separate by intervening material, it should also be possible to

select as a single markable discontinuous text spans.

(134) <<La responsabilità è politica – aveva aggiunto il Procuratore capo- ed è il

potere politico che deve far funzionare i servizi>>. (ISST els046)

<<The responsibility is political– had added the Chief Prosecutor– and it is

the political power that has to make services work>>.


- 89 -

Moreover, overlapping text spans should also be selectable as it is often the case

that attribution relations are nested into each other (see 3.2.1). To each selected

markable it should be therefore possible to associate the features it possesses,

through the selection of predefined values, thus speeding up the annotation

process and avoiding spelling errors the annotators could make when manually

writing these values. Finally, in case this is not automatically done when adding an

attribution relation, the tool should support linking two or more markables to

establish relations in both directions.

Lastly, concerning the output of the tool, this should save the annotation as

stand-off in a separate file each article identified by the same index as the files

containing the other levels of annotation for the same article (i.e. cs.morph001,

cs.orth001, etc…, cs.attr001). In-line annotation, consisting of adding XML tags to

the original text as in the example (135) cannot represent overlapping markables,

because of XML syntax, and therefore is not suitable for describing attribution

relations.

The annotation should preferably refer to the word index (136), thus establishing a

unique pointer to each token in the corpus corresponding to each line in the table

format (Figure O) and not to the byte as e.g. white spaces and multi-words would

possibly determine a mismatch between the bytes in the original files and those

the tool refers to. Although possible, transforming the byte reference into the word

index reference can lead to additional errors and should be dispreferred.

(135) <content>“In città non abbiamo uno scippo”</content>, <cue>ha

dichiarato</cue> <source>il sindaco</source>. (ISST re040)

“In town we do not have a single bag-snatching”, declared the major.

(136) <markable id=”1” span=”token_001…token_008” role=”content”>

<markable id=”1” span=”token_010…token_011” role=”cue”>

<markable id=”1” span=”token_012…token_013” role=”source”>


- 90 -

5.2.2 Comparison of Available Tools

In order to select the most appropriate software to employ to perform the pilot

annotation of attribution relations, features of different available tools have been

compared in the light of the requirements listed above (5.2.1). Only open-source

tools have been taken into consideration. The analysis that follows is not intended

to provide a full account of every tool described but just to highlight positive and

negative aspects with respect to the present annotation project. While the most

promising tools have been tested via setting a sample annotation schema and

performing the annotation of a single file, tools which appeared to be incompatible

with the most important requirements were soon dismissed and not further

investigated, together with those tools potentially meeting these requirements but

practically requiring complex modifications to their code.

A selection of possibly suitable tools has been drawn from surveys available

on the internet, such as David Lee’s Corpus-based Linguistics LINKS

(http://personal.cityu.edu.hk/~davidlee/devotedtocorpora/CBLLinks.htm) and

considering the tools adopted by similar annotation projects.

Since there is no tool specifically developed for the annotation of attribution,

general annotation tools or tools for the annotation of anaphora or discourse,

phenomena relatively similar or overlapping with attribution and therefore also

likely to require a similar description, have been considered. A brief analysis of the

main tools taken into account is reported below.

GATE

GATE (Cunningham et al., 2002), General Architecture for Text Engineering, is a

very complete architecture (freely available to download from http://gate.ac.uk/)

allowing the development of language processing software. The tool supports a

variety of formats, such as XML, RTF, HTML, plain text, although only the latter

was easily accepted and used for the sample annotation. A set of NLP resources

are provided with the tool and include a POS and a semantic tagger and a

coreferencer. Setting an annotation schema was a relatively easy task which could

be performed in a few minutes.


- 91 -

Figure N - GATE annotation environment

The tool supports nested annotation, as the same portion of text can be selected

several times, however, the annotation of discontinuous spans is not possible as it

is not allowed to include in the same markable non adjacent spans. It also seems

not to be possible to establish relations between markables. Moreover, the

annotation itself is quite problematic as the selection of the text spans, their

deletion or modification, and the addition of features is not intuitive. GATE, which

was used for example for the annotation of the MPQA Opinion Corpus (Wiebe et

al., 2005), includes also a query tool. The annotation is stored in XML format with

reference to the byte, as in the example below (Figure O).

<?xml version='1.0' encoding='windows-1252'?> <GateDocument>  <GateDocumentFeatures> <Feature> <Name className="java.lang.String">gate.SourceURL</Name> <Value className="java.lang.String">file:/C:/Documents%20and%20Settings/Prova1.txt</Value> </Feature> <Feature> <Name className="java.lang.String">MimeType</Name>


- 92 -

<Value className="java.lang.String">text/plain</Value> </Feature> <Feature> <Name className="java.lang.String">docNewLineType</Name> <Value className="java.lang.String">CRLF</Value> </Feature> </GateDocumentFeatures>  <AnnotationSet> <Annotation Id="8" Type="Source" StartNode="2736" EndNode="2746"> <Feature> <Name className="java.lang.String">Type</Name> <Value className="java.lang.String">Arbitrary</Value> </Feature> </Annotation> <Annotation Id="9" Type="cue" StartNode="2747" EndNode="2771"> <Feature> <Name className="java.lang.String">Factuality</Name> <Value className="java.lang.String">Non-factual</Value> </Feature> <Feature> <Name className="java.lang.String">Scopal change</Name> <Value className="java.lang.String">None</Value> </Feature> <Feature> <Name className="java.lang.String">Type</Name> <Value className="java.lang.String">Fact</Value> </Feature> </Annotation> <Annotation Id="10" Type="content" StartNode="2772" EndNode="2864"> </Annotation> </AnnotationSet>  <AnnotationSet Name="Original markups"> <Annotation Id="0" Type="paragraph" StartNode="0" EndNode="2887"> </Annotation> </AnnotationSet> </GateDocument>

Figure O - GATE annotation exported in XML

Knowtator

Meant to serve a wide range of annotation purposes, Knowtator (Ogren, 2006) is a

plug-in of the knowledge representation system Protégé (both freely downloadable

from http://knowtator.sourceforge.net/) which allows the definition of annotation

schemas. Setting an annotation schema is not particularly complicated, however,

for the attributes it is not possible to set pre-defined values to choose from but only

one default element. This means that values have to be typed in manually by the


- 93 -

annotators thus representing an additional difficulty, and a consequent chance for

errors.

The tool, however, supports establishing a relation between markables as

well as multiple selections. A multiple slot for instances of source or cue could be

for example inserted in the cue class as in Figure P (right hand side in the middle).

This shows a sample annotation project consisting of a single file and of a single

attribution relation. The file containing the annotation is presented in Figure Q.

Nested and discontinuous selections are also supported. A searching function is

instead not available. Another negative side is that although relatively easy to set,

the tool is quite complicated to use and requires some training as the markable

selection and addition of features make use of icon buttons in a not very user-

friendly manner.

Figure P - Knowtator annotation environment

A collection of texts can be defined for a project. These should be plain text,

however XML and database table formats should also be supported. The tool

provides stand-off annotation with reference to the byte (Figure Q). The output is

relatively redundant as every annotation and feature is saved as a separate


- 94 -

annotation instance with explicit mention of the annotator and creation date.

<?xml version="1.0" encoding="UTF-8"?> <annotations textSource="01.txt"> <annotation> <mention id="Attributionprova_Instance_20000" /> <annotator id="Attributionprova_Instance_6"> Pareti, Edinburgh University</annotator> <span start="439" end="494" /> <spannedText>Il presidente della Banca Centrale, Jean-Claude Trichet</spannedText> <creationDate>Sun Aug 16 18:24:30 CEST 2009</creationDate> </annotation> <annotation> <mention id="Attributionprova_Instance_20003" /> <annotator id="Attributionprova_Instance_6"> Pareti, Edinburgh University</annotator> <span start="496" end="506" /> <spannedText>ha parlato</spannedText> <creationDate>Sun Aug 16 18:24:49 CEST 2009</creationDate> </annotation> <annotation> <mention id="Attributionprova_Instance_20007" /> <annotator id="Attributionprova_Instance_6"> Pareti, Edinburgh University</annotator> <span start="511" end="544" /> <spannedText>grave rallentamento dell’economia</spannedText> <creationDate>Sun Aug 16 18:25:24 CEST 2009</creationDate> </annotation> <classMention id="Attributionprova_Instance_20007"> <mentionClass id="Content">Content</mentionClass> </classMention> <classMention id="Attributionprova_Instance_20000"> <mentionClass id="Source">Source</mentionClass> </classMention> <classMention id="Attributionprova_Instance_20003"> <mentionClass id="Cue">Cue</mentionClass> <hasSlotMention id="Attributionprova_Instance_20009" /> <hasSlotMention id="Attributionprova_Instance_20010" /> </classMention> <stringSlotMention id="Attributionprova_Instance_20009"> <mentionSlot id="type" /> <stringSlotMentionValue value="Assertion" /> </stringSlotMention> <complexSlotMention id="Attributionprova_Instance_20010"> <mentionSlot id="Attribution_source" /> <complexSlotMentionValue value="Attributionprova_Instance_20000" /> <complexSlotMentionValue value="Attributionprova_Instance_20007" /> </complexSlotMention> </annotations>

Figure Q - Knowtator annotation exported in XML

Callisto

The annotation tool Callisto (open-source, available from http://callisto.mitre.org/)


- 95 -

was adopted for a part of the annotation of temporal relations in the frame of

developing the ITB, Italian TimeBank (Caselli et al., 2008), on a portion of the ISST

corpus. The tool has a very neat and basic interface nonetheless allowing setting

user preferences. Overlapping text spans can be selected as well as single

characters, by changing the annotation from ‘word’ to ‘character swiping’.

To create a new ‘task’, some annotation schemas e.g. POS or coreference

are already available, it is necessary to define a DTD. However, this possibility

seems no to easily work and therefore the tool was not set for the annotation of

attribution on a sample article. This was not necessary, since the tool does not

meet some important requirements as it seems not possible to select

discontinuous spans as a single markable and to establish relations between

markables. Callisto annotation is saved as stand-off with reference to the byte,

however, the conversion into word index reference is supported.

MMAX2

Written in Java, MMAX2 (Mueller and Strube, 2006) is a general purpose tool

(available open-source from http://mmax2.sourceforge.net/) with a special focus

on the annotation of anaphoric/ coreferential expressions, word sense

disambiguation and POS tagging. Starting a project requires some time as the

annotation schema has to be externally specified prior to launching the program.

Nonetheless MMAX2 is a very flexible instrument that allows personalising the

display of the annotation tool using XSL Style Sheets.

The tool requires text input files, however XML support is under

development and should be available shortly. The tool can be set so as to guide

the annotation presenting default and pre-defined values to choose from for the

attributes. MMAX2 allows the selection of overlapping and discontinuous text

spans as well as the possibility to link markables together using relations.

The stand-off annotation provided by the tool points to the word index and

not to the byte as most other tools. Every markable level is saved in a different

XML file where to each markable is associated an ID, the pointer to the text span

and any other feature or relation associated with it. The result is a very compact

and easy to read annotation. The tool was employed, among others, for the

annotation of anaphora an deixis in the VENEX corpus (Poesio et al., 2009).


- 96 -

Annotator

Annotator was the tool especially developed for the annotation of discourse

connectives and their argument in the PDTB (Prasad, Miltsakaki et al., 2008). The

tool supports the annotation of attribution on ‘raw text’ files according to the

schema adopted by the PDTB (see 2.4.3). The interface is very user-friendly and

guides the annotation with listing the possible values from which to select and with

employing constraints. Unfortunately, however, the tool could not be adapted to the

present annotation schema as Annotator does not support the setting of different

markables or features. The tool could be adapted by changing the source code,

however this was not available and represents anyway a time-consuming task

similar to writing a completely new annotation software.

Annotator was not designed to account for attribution relation not occurring

in correspondence to the discourse connective structure. In addition, nested

attributions are not contemplated and it is not possible to specify the role of each

of the three element constituting an attribution (i.e. source, cue, content) and

establish relations between them. The tool produces stand-off annotation with

reference to the byte. Even though the tool represents a good example of how an

annotation tool for attribution could be also designed and implemented, since it

was not possible to adapt Annotator to the annotation schema developed in this

study, this could not be considered a possible candidate for the pilot annotation.

Other tools

Among other tools, also NITE and EXMARaLDA were briefly taken into

consideration. NITE XML Toolkit (open-source at:

http://sourceforge.net/projects/nite/files/) is a very powerful instrument aimed at

software developers which allows building specialised annotation schemas and

interfaces for a wide range of purposes. It is especially intended to support

multimedia language data and it has been employed in a number of meeting and

dialogue corpora. NITE, however, is quite complex to set up and a sample

annotation project could not be developed to test it.

EXMARaLDA (Schmidt, 2001), Extensible Mark-up Language for Discourse

Annotation (available from : http://exmaralda.org/), is a system of Java based tools

with XML data formats especially designed for the annotation and assisted


- 97 -

transcription of spoken language. The tool is not suitable for the annotation of

attribution relations as it is not meant to relate markables and it does not support

the definition of an annotation schema as it would be required for attribution.

5.2.3 Selection and Tool Specifics

Concerning the requirements specified in (5.2.1) priority was given first to those

features enabling the selection of the text spans involved in attribution, i.e.

discontinuous and nested markables, together with the possibility of establishing

relations among cue, source and content markables, including the eventuality of

having more than one source and/or content each relation. Subsequently, the tool

customizability and user-friendliness were also considered, with particular

attention to the possibility of setting guided choices for the markable features. Part

of this second group of requirements was also the tool annotation format, ideally

neat and compact stand-off XML annotation with reference to the word index.

Other aspects, such as the possibility of querying the corpus with reference to

other levels of annotation in order to retrieve possible cues or the support of input

data in a format other than text, were temporarily left aside.

From the tool considered above (5.2.2) only two, Knowtator and MMAX2,

appeared to meet the first set of requirements and were therefore more closely

compared to check other relevant characteristics.

Supported features Knowtator MMAX2

Discontinuous text selection Yes Yes Nested selection Yes Yes Relations Yes Yes Multiple sources/contents Yes Yes

Pre-defined values selection No (one default) Yes (menus) Display customizability Yes (partial) Yes (complete) Ease of setting a scheme Simple (internal) Medium (external) Ease of annotation Medium Simple XML stand-off output Yes Yes Reference to word index No (byte) Yes

Table 3 - Knowtator/ MMAX2 feature comparison


- 98 -

Knowtator and MMAX2 differ in some aspects concerning the second group of

requirements. These are listed in the lower half of Table 3. Knowtator is certainly

easier to set as the annotation schema and customization can be internally

defined through the interface. MMAX2 requires instead the modification of XSL for

both setting the annotation schema and customizing the interface and display of

the annotation.

On the other hand MMAX2 can be more personalised and the annotation

scheme better specified so as to have pre-defined values to select from, thus

facilitating the annotation by reducing the annotators’ cognitive load. The

annotation itself, i.e. selection, deletion, extension of a text span, is also easier.

Lastly (last row in Table 3), MMAX2 saves the annotation as stand-off with

reference to the word index, whereas Knowtator refers to the byte.

Considering all their characteristics, the higher setting costs of MMAX2

seems to be well compensated by a subsequent more structured annotation and a

more flexible interface. This, together with the possibility to anchor the markables

span to the original text through references to the word indexes, made this tool

prevail as the most suitable for the present purpose of annotating attribution

relations according to the proposed schema.

5.3 Setting MMAX2

Installing MMAX2 is easy, though it requires a current Java version installed on the

machine to run. Once the program is launched, it is possible to start a project

using the Project Wizard shown in Figure R. In this window the ‘raw text’ input file

that will be used for the annotation has to be selected. This file gets then analysed

by the program and tokenised. The article from the ISST were previously

tokenised and corrected, it was therefore not necessary to do it again as in this

case it is possible to tick the ‘Input file is one token per line’ box.

Afterwards it is required to specify at least one markable level for the

annotation, and eventually some display preferences related to it. The last section

in the window (Figure R) contains the paths to where the different project

components are stored and allows selecting a name for the project and the stored

input file.


- 99 -

Figure R - MMAX2 Project Wizard

Each MMAX2 annotation project has five different components (and a

common_paths file specifying where these are stored):

-the Base Data, that is the data on which the annotation is performed. For the

present project this consists of an XML file each article, derived

from the ‘raw text’ files provided to the tool as input. The file has a

token per line to which a progressive word index is assigned

(Figure S).

<?xml version="1.0" encoding="US-ASCII"?> <!DOCTYPE words SYSTEM "words.dtd"> <words> <word id="word_1">LONDRA</word> <word id="word_2">.</word> <word id="word_3">Gas</word> <word id="word_4">dalla</word>


- 100 -

<word id="word_5">statua</word> <word id="word_6">Evacuata</word> <word id="word_7">la</word> <word id="word_8">Tate</word> <word id="word_9">Gallery</word> <word id="word_10">.</word> <word id="word_n">…</word> </word>

Figure S - MMAX2 Base Data (ISST cs001)

-the Scheme, an XML file for each markable level containing the annotation

schema. This file specifies markable attributes and relations.

Attributes represent descriptive information, while relations account

for structural or associative information. Relations in MMAX2 can

be of two kinds: ‘markable-set’, undirected relations between two or

more markables, and ‘markable-pointer’, a directed relation from

one markable to one or more target markables. Attributes can be

simple FREETEXT, thus accepting any string as their value,

NOMINAL_LIST, a pre-defined closed set of possible values

presented as a drop-down menu, or NOMINAL_BUTTON, similar

to the precedent but the values are presented as a sequence of

radio buttons. In the Scheme file it is not only possible to set the

type of attributes, with their pre-defined values, and relations, but

also to determine a hierarchy of attributes. Dependencies can be

expressed by adding a ‘next’ value to an attribute specifying, in

case this one is selected, which other attribute or set of attributes to

enable.

-the Style, an XSL file which defines the display. Here the way the text and the

annotation are presented can be modified, for example, by adding

handles to the markables, inserting empty lines or structuring

dialogue turns.

-the Customization file (XML), containing a description of how each markable

should be visualised, i.e. foreground and background colour, size


- 101 -

and font aspect, according to its attributes and relations. A

markable that has not yet been assigned attribute values could be

associated e.g. with a different background colour so as to be

easily spotted as requiring the completion of the annotation.

- the Markable directory, containing the annotation in XML format. The annotation

of each article is stored in a separate file, as it was for the Base

Data, while Scheme, Style, Customization are common to the

entire project. This file represents the stand-off annotation and lists

all the markables for the specific level of annotation. Markables are

assigned a unique ID and a reference span, pointing to the original

text, stored in the Base Data, by pointing to the word index (e.g.

span="word_62..word_78"). For each markable attribute values

and relations are specified.

In a preliminary stage, cue, content and source were defined as three separate

markable levels. This allows keeping the three components completely distinct

during the annotation process and makes it possible to select immediately the role

of each markable when the text span is selected, as shown in Figure T on the right

hand side.

Figure T - The annotation of cue, content and source as separate levels

On the other hand, however, this results in the annotation of each article being


- 102 -

stored in three separate files, one each markable level. As having the annotation

on one single file guarantees better access to it and less storage space, the

Scheme was changed and the components of attribution were subsequently

annotated on the same markable level as different attributes.

The pilot annotation project consists of an ‘input’ directory containing the 50

articles, tokenised and in XML format, i.e. the Base Data, and the Markable

directory containing 50 corresponding files where the annotation ‘output’ is stored.

Scheme, Customization and Style are instead common and had to be written only

once. For each article a ‘.mmax’ file is also produced, this contains the reference

to the Base Data file corresponding to it and it is this file that needs to be loaded

when opening the relative annotation project with the tool.

5.3.1 Scheme

The Scheme, included in Appendix 1, is the most interesting component of

MMAX2 as it describes the annotation schema and the way it is presented during

the annotation. After selecting a markable on the text, it is possible to assign

attributes to it through the annotation window. This is initially displayed as in Figure

U, where just the role of the markable in the attribution relation can be selected,

‘none’ being the default value and all the possible values being displayed as radio

buttons. Only when the relevant role has been selected, other features are

activated.

The ‘type’ feature is definitely related to the cue, together with the

‘factuality’. The ‘source type’ is an attribute of the source, however, implicit sources

can frequently occur with the consequence of no source markable to be available.

Not to loose the information about the ‘source type’, this attribute can be made

available when selecting the cue. As the cue is the textual anchor of the attribution

relation, it is never missing and can therefore carry information about ‘weaker’

elements.

As far as the ‘scopal change’ is concerned, as the change in the scope

usually involves reversing the polarity or factuality of the content, this could have

been associated with it. Considering however that the element changing scope is

usually included in the ‘cue’ span, e.g. the negation of the attribution verb, and that

it would be easy to forget this relatively infrequent attribute if this would be


- 103 -

separate from the other ones, the ‘scopal change’ has also been made available

for selection through the cue.

Figure U - MMAX2 Annotation window

When a markable is defined as the cue, the annotation panel shows also all the

features connected to it (Figure V). The type attribute has by default the value

‘none’. When ‘assertion’, ‘belief’ or ‘eventuality’ is selected, the ‘scopal_change’

feature is also made available. As a change in the scope cannot occur with ‘facts’,

this feature is disabled in order to facilitate the annotation.

Figure V - MMAX2 Annotation window (attributes)


- 104 -

Factuality is by default ‘factual’ as this is by far more frequently the case, the

‘unmarked’ value, ‘non-factual’ has to be therefore voluntarily selected. The source

is by default ‘writer’. It is the writer in fact the shallowest source of any attribution

relation. The annotation scheme includes also a free text slot for the ‘source_ID’

attribute. This was left blank in the pilot, however it has been included in the tool

Scheme as it represents a highly desirable feature the final annotation should

posses. The same source can be in fact mentioned in an article, in a number of

different ways, e.g. proper name, common name, profession, pronoun, etc…

This feature is included in the Opinion Corpus (Wiebe, 2002) where it not

only provides the source with a unique ID, assigned by the annotator, but it also

accounts for embedded attributions. In this slot the annotator should list, from the

shallowest, i.e. the first source the writer is mentioning or the writer itself when

explicit, to the most embedded one, i.e. the one directly holding the content in the

attribution relation.

The ‘source ID’ slot should ideally be redundant. A coreference tool should

be able in the future to automatically and reliably relate pronouns and alternative

full nouns to the original source, similarly as it should be possible to do for

coreference relations involving the content. It should also be possible to derive the

additional sources to the left of an attribution by identifying text spans containing

the one corresponding to the attribution. Once an attribution relation is included in

the content of another attribution, it should inherit its source. By performing this

task starting from the ‘outside’, a nested attribution would simply inherit all the

‘external sources’ from the attribution immediately above as they would be all

already listed in its ‘source ID’.

Relations are not established in the annotation window, although they are

shown as the last element (Figure V), but directly on the window displaying the

text by selecting a markable and then right-clicking on the markable this should be

related to, the option ‘add to markable set’ should be then available as in Figure W.

When an element part of a relation is selected (Figure W below) the markables

part of the same relation are shown with a grey background and linked by a red

line.

The type of relation adopted for the annotation of attribution is the ‘markable

set’. This allows to relate as many markables as required. As the relation is


- 105 -

undirected this can be retrieved from the annotation of any markable part of the

set, and not only from the annotation of the markable from which the relation

originates as with the ‘markable pointer’ relation. This was especially important as

attribution relations are bidirectional, it is in fact necessary to trace the source from

the content, but also vice versa.

Figure W - MMAX2 Annotation of relations

5.3.2 Customization

The Customization file was written as reported in Appendix 1. The third line

specifies the display preferences for all the markables. Every selected span is

highlighted by surrounding it with black handles and showing its text in blue, bold

font. The following lines in the file define for each attribute specific display

preferences. This could have been done for every different value of every feature

connected to attribution, however an excessive differentiation instead of helping

the annotation and fruition of the annotation by visually characterising different

elements, would simply confuse. It would be in fact necessary to memorise the

association of many colours and font effects to the different features involved in

the annotation.

Only the components of attribution were therefore visually characterised,

once a span is marked as the cue, its background is changed to orange, the

content has instead a cyan background, the source and the supplement a green

and a light gray one respectively. Apart from allowing the immediate identification

in the text of cue, content, source and supplement, these display settings provide


- 106 -

a feedback about the successful annotation of a markable.

In case the annotator forgets to assign a role to a selected markable, or, in

case of uncertainty, intentionally leaves that for a later stage, the display will

continue showing the markable (blue, bold font and handles) without colour

background, thus making it easier to identify it later on when completing the

annotation. Similarly, in case the annotator fails saving the annotation, as it is easy

to forget selecting the ‘auto-save’ function every time a different annotation project

is loaded and even more to manually save the annotation for every single

markable, this can be immediately noticed. The annotator can therefore select the

‘auto-save’ option and repeat only the last markable selection.

5.3.3 Style

The Style sheet (also reported in Appendix 1) has not been deeply modified. While

for example dialogues are surely better displayed with separating turns and

differentiating the actual text from the speaker, news articles have a simpler

structure. Apart from the body text, the other elements are the title(s) and the

author, when explicitly mentioned. However, distinguishing these elements for the

annotation is not necessary. On the other hand, since attribution relations can

often be found nested one in another, adding handles to the right and to the left of

each markable represents the only way to make it possible to identify these

instances (Figure X). Handles were therefore added in the Style sheet.

Figure X - Nested attributions visible through handles


- 107 -

5.4 Feasibility of the Schema and Issues

While performing the pilot annotation several issues arose leading to a

reconsideration of the annotation schema. This was partially modified and

reapplied to the sample corpus. Some changes were determined by the tool

characteristics, in order to better exploit its potential or make up for shortcomings

so that the schema was adequately represented and the annotation process

relatively easy and intuitive. Other issues were brought up by acquiring evidence

of real language occurrences of attribution relations presenting aspects not yet

considered. Finally, doubts and difficulties in applying the schema shed light on

features of the schema requiring further investigations to reach a more appropriate

description. These issues, which have already been analysed in the relevant

chapters, will be here only shortly presented.

First of all, the annotation process highlighted the necessity of more

precisely determine the scope of the attribution relation, i.e. the text span to select

as source, cue or content. Adverbs, relative clauses, appositive or other elements

can in turn represent highly informative material contributing to the interpretation of

the attribution or disruptive additional information which could be better left out of

the annotation. In addition, through the annotation it was possible to realise the

necessity of a solution to preserve the information carried by ‘source of the source’

elements (3.2.2), namely the provenance of the knowledge acquired through verbs

of the ‘fact’ type (e.g. John knows FROM MARY, that…) and recipients of messages

presupposing a perlocutionary act i.e. the indirect object of eventualities

(especially influence verbs, e.g. The pope prohibits CATHOLICS to… ). In order to

account for these elements which do not however correspond to any of the three

components of an attribution relation, a ‘supplement’ role was added and included

in the annotation.

Moreover, it was necessary to account for instances of multiple sources

belonging to different types. As all attributes have been added on the cue, i.e.

associated to the text span corresponding to the cue, it is not possible to give

different values for the ‘source type’ feature. It could have been instead possible

with marking the ‘type’ directly on the source, therefore assigning a type to each

source in the attribution relation, however, the null or hidden sources would have

then been problematic as they have no corresponding text span. Hidden sources


- 108 -

are a lot more frequent than multiple sources belonging to different source types

and therefore the former issue was given priority. The solution adopted was that of

including a value ‘mixed’ for the ‘source type’ attribute. In addition, the frequency

of coreference relations involving the source led to the addition of a ‘source ID’

attribute as described in (3.2.1, 5.3.1).

Lastly, assigning a value to the features ‘type’ (4.1) and ‘scopal change’

(4.4.) turned out to be in some cases not certain and depending on subjective

considerations. Thus the necessity of a statistical analysis of the problem,

confronting inter-annotator agreement on these features, in order to estimate its

entity and introduce the appropriate changes to the schema if required.

5.5 Summary

In order to develop an annotation schema for the phenomenon considered in this

study and test its efficacy, it was decided to develop a pilot annotation. The pilot

was performed on a balanced portion of the ISST corpus consisting of 50 articles.

The annotation was carried out with the help of an annotation tool.

In order to select the most suitable tool, specifications for the proposed

annotation schema were listed and confronted with the available software. Some

requirements, such as the possibility to select discontinuous text span for a single

markable and to relate markables through relations, were considered having

priority and tools not meeting them were discarded. The two remaining tools,

Knowtator and MMAX2, were confronted with respect to the additional

requirements, e.g. their customisability, how the annotation is saved and user-

friendliness.

MMAX2 was eventually adopted and set for the annotation. This required

delineating the annotation ‘Scheme’, i.e. how to organise cue, source and content

as well as their attributes and constraints, as well as defining ‘Style’ and

‘Customisation’. The articles had to be prepared, that is corrected and in raw text,

to become the XML ‘Base Data’ of the annotation. The annotation of each article

was stored stand-off in a single XML file with reference to the word index.

The pilot allowed identifying some issues with confronting the annotation

schema with real language instances. These, together with the constraints


- 109 -

determined by the tool characteristics resulted in the partial modification of the

annotation schema, in order to account for example for the problem of co-

reference resolution, and phenomena such as ‘sources of the source’ and ‘mixed

sources’. Although some modifications might still be required, e.g. ‘type’ and

‘scopal change’ features, once the applicability of the schema has been

statistically evaluated, the annotation scheme developed so far proved to be

feasible and the annotation, with the help of the tool and of annotation constraints,

rather reliable although at times problematic.

6 Annotation Schema and Guidelines

- 110 -


The annotation process starts with loading an article at a time in the MMAX2 tool

and is generally performed in five phases. First of all it is necessary to identify the

presence of an attribution relation. This is usually done starting from the

identification of an attribution cue, typically punctuation marks and reportive verbs.

However, for the relation to be annotated, it is not enough to find elements linked

by the cue which function as source and content.

The content should in fact express the object of the attribution and not just

its description. An attribution like ‘John said two words’ is not relevant (while ‘John

said: “two words”’ would be), unless it is necessary in order to relate ‘John’ to the

actual two words he pronounced which can be expressed somewhere else in the

article, similarly to coreferential pronouns functioning as content.

Also idiomatic or ‘false’ attributions ( (137), (138), (139)) should not be

annotated. These attributions in fact are not meant to establish a relation between

source and content. The source of idiomatic attributions is also generally hidden.

Examples (137) and (138) represent a specification or a concession with respect

to what it was previously said. In (139) the reportive verb ‘say’ is just employed to

express an equivalence (Biedermeier = “il buon Meier”).

(137) C'È DA DIRE CHE d'Arminio Monforte non sarà scelto dall'intraprendente El

Sayed ad interpretare ed eseguire le nuove strategie che dovranno portare

a così alti traguardi. (ISST els001)

IT SHOULD BE SAID THAT Arminio Monforte won’t be chosen by the

enterprising El Sayed to interpret and execute the new strategies that will

have to bring to so high achievements.

(138) Perché VA DETTO CHE il signor B. spesso ha una casa fuori porta, in mezzo

al verde. (ISST perod001)

Because IT HAS TO BE SAID THAT mister B. often has a house out of town, in

the countryside.


- 111 -

(139) Biedermeier, COME DIRE "il buon Meier": il cittadino medio del secolo scorso,

protagonista di un'epoca, un gusto, uno stile. (ISST perod001)

Biedermeier, AS TO SAY “the good Meier”: last century average citizen,

protagonist of an epoch, a taste, a style.

In the second phase, after having identified an attribution relation, the relevant text

spans need to be selected and labelled as markables. They hence get displayed

as blue bold text in between square brackets. Afterwards, a role (Figure Y) is

assigned to each markable which is therefore shown with a specific colour

background. The following passage consists in assigning values to each attribute

in the annotation. Lastly, the markables need to be linked in a relation. This can be

done by selecting a markable and right-clicking on the elements which should be

included in the same set. When a markable in a relation is selected, the markables

part of that relation set are displayed joined by red arches.

Figure Y - Attribution relation components

In this chapter the annotation schema developed in this thesis will be summarised

and presented as it has been employed in the pilot annotation. Indications will be

provided regarding the selection of the relevant text span for each of the

constitutive elements of the attribution relation. With the use of examples from the

corpus, instructions concerning how to assign the values for each annotated

features will be also given. All the recommendations reported in the following

chapters however have to be regarded as suggestions, a referential repository of

good practice examples with the aim of facilitating the annotation process, rather

than prescriptions. The context and a full awareness of the goals to achieve

should alone be sufficient to reliably drive the annotation. Would this prove

incorrect the strategy adopted here should be abandoned in favour of a more

controlled one.

SOURCE(S) CUE CONTENT(S)

relation

(SUPPLEMENT)


- 112 -

6.1 Text Spans Selection

Once an attribution relation is found, it is necessary first of all to identify its

constitutive elements (Figure Z) and determine which span represents them. Each

relation requires at least three components: the cue, i.e. the textual anchor

signalling the relation; the content, that is the attributed material; and the source,

the entity the content is attributed to. The source can be missing as it is sometimes

left implicit. It should be however clear when annotating which implicit entity the

attribution refers to. In some cases it is instead possible to have multiple instances

of ‘source’ and ‘content’. In addition to these three components there is a fourth

one, the ‘supplement’, which can be optionally used to mark additional relevant

information.

Figure Z - Annotation, text spans selection.

The text spans corresponding to cue, source and content should be first selected

(Figure Z) thus enabling the option of creating a markable with the selected text. In

case extensions or reductions to the text span corresponding to a markable are

required, it is possible to do so with choosing ‘add’ or ‘remove from this markable’

from the menu on the selected span.


- 113 -

Elements that can possibly constitute each markable type are listed in Figure AA.

Deciding what is in the scope of the attribution relation, i.e. what exactly to

comprise in each markable, should not be taken for granted. In the following

chapters indications will be provided about each markable type and what should

be included or left out of its text span.

Figure AA - Annotation, elements which could function as a markable.

6.1.1 Source Span

In general, in the source span should be included all those elements relevant to

the identification of the entity having this role. However, what is to be considered

relevant needs to be defined. The source should always comprehend the full noun

phrase expressing it ( (140) attribution 1) or, in case the source is represented by

an adjective or a prepositional phrase (141), these elements have to be included.

(140) [Il ministro del Tesoro]1 [ha indicato anche] 1 [l'obiettivo del prossimo anno:

4 per cento]1. Ø [Ha anche aggiunto] 2 [che i risultati positivi derivano

soprattutto dalla caduta dei prezzi del petrolio, da quello delle altre materie

prime e dal calo del dollaro] 2. (els020)

[The Secretary of the Treasury] 1 [has also indicated] 1 [next year goal: 4

per cent] 1. (He) [has also added] 2 [that the positive results mainly derive

from the drop of the petrol price, from that of other raw materials and the

dollar decrease] 2.

SOURCE(S) CUE CONTENT(S)

relation

(SUPPLEMENT)

-verb -noun -adjective -preposition -prep. group -graphic marker

-noun phrase -adjective -prep. phrase

-word -phrase -clause -sentence -entire article

-cue modifier -indirect object -source of source -event specification


- 114 -

(141) Le parole registrate di Gheddafi, …(ISST cs039)

Gheddafi’s recorded words,…

In case of appositives or relative clauses referring to the entity in the noun phrase

and contributing to its characterisation, these should also be selected together with

the noun phrase as in the example (142). When they instead digress from the task

of identifying the scope as in (143) and constitute a mere description or provide

additional details which are not necessary, they should not be annotated.

(142) …il presidente della casa giapponese, Osamu Suzuki, ha previsto

un'ulteriore flessione dei profitti anche per quest'anno. (ISST sole100)

…the president of the Japanese trade, Osamu Suzuki, has predicted an

additional fall of the revenues also for his year.

(143) <<Un'idea geniale>> l'ha definita Cesare Verlucca, editore piemontese

pronto ad affrontare i salotti dopo il successo di vendite ottenuto dal Salone.

(ISST sole040)

<<A genius idea>> has defined it Cesare Verlucca, publisher from

Piedmont ready to face the ‘salotti’ after the sale success obtained at the

‘Salone’ fair.

When the relation is part of a relative clause with the source expressed by a

relative pronoun, just the pronoun should be annotated as in (144). The full noun

the relative pronoun refers to, in this case ‘Milan vice-president, Galliani’, should

be syntactically retrievable, moreover, it will be reported in the ‘source ID’ slot. Null

or missing subject, having no corresponding span, should not be marked on the

text ( (140) attribution 2).

(144) Una provocazione collegata a un recente colloquio con il vicepresidente del

Milan, Galliani, il quale ha convenuto con me circa l’insostenibilità della

situazione. (ISST cs064)

A provocation connected to a recent conversation with Milan vice-president,

Galliani, who agreed with me about the situation being unbearable.


- 115 -

6.1.2 Cue Span

The cue can be expressed by a considerable number of elements thus making it

difficult to automatically recognise it. Most commonly, however, cues are reportive

verbs. Apart form including the particle or expression reversing its polarity (145),

adverbs ( (146), (147)modifying the attitude should also be included while

complements or specifications, which should just be considered in case they

provide relevant information, can be included in the supplement.

(145) Ø Non ho mai pensato che <<Il Dottor Zhivago>> potesse essere

considerato un’opera ostile al socialismo. (ISST els076)

(I) have never thought that <<Doctor Zhivago>> could be considered a work

against socialism.

(146) Afferma ufficialmente l’Antitrust : <<Le modalità di pubblicizzazione del

prezzo consigliato…>>. (ISST sole049)

The Antitrust officially affirms: <<The advertisement modalities of the

suggested price…>>.

(147) Ieri sera i segretari generali hanno esplicitamente detto di essere

d’accordo con una delicata proposta contenuta nel documento dei giuristi

sulle sanzioni da applicare ai singoli lavoratori che si rifiutassero di prestare

il lavoro richiesto per garantire il minimo di servizio. (ISST els079)

Yesterday the general secretaries explicitly said that they agree with a

delicate proposal included in the lawyers’ document concerning the

penalties to inflict to the individual workers who would refuse to give the

work required to guarantee the minimum service.

Similarly ‘cue of the cue’ particles, i.e. usually complements expressing mean or

provenance, should not be labelled as ‘cue’ but included in the annotation as

‘supplement’, together with the indirect object, when expressed. In this way these

elements and the information they carry would be retrievable.

When more than one cue belonging to the same type, i.e. conveying the

same attitude the source holds, is expressed, these should all be included in a


- 116 -

single ‘cue’ markable as to each and every ‘cue’ markable corresponds an

attribution relation. Cases like (148) with a redundant cue and source (since

Martino and the Foreign Secretary are the same person) are also possible. In this

case the co-referential sources should be included in the same source markable,

included in the squared brackets labelled as ‘s’. Similarly, the two cues will

constitute a single cue markable, the corresponding text span is marked with ‘c’.

While cues of different types should be split into separate attribution relations,

those of the same type concur to signalling the presence of an attribution and

should be grouped. An exception is made only for punctuation cues which should

be annotated only when the relation is not signalled by any other mean as in (149).

(148) [Secondo]c [il ministro degli Esteri]s, [la prossima ondata di ottimismo ci

sarà], [ha detto]c [Martino]s, [quando comunicheremo le nostre prime

iniziative concrete]. (ISST sole017)

[According to]c [the Foreign Secretary]s, [the next wave of optimism will

take place], [said]c [Martino]s, [when we will announce our first concrete

initiatives].

(149) Il Papa: “La cultura ha bisogno del genio femminile”. (ISST cs014)

The Pope: “Culture needs the female genius”.

Lastly, when the attribution is a question, the cue should include the element

giving the utterance the interrogative form, i.e. the question mark in case of direct

questions (150). This element should not be included in the content: It is in fact the

cue, therefore the attribution itself which is questioned and not that a question is

the content of the attribution.

(150) Pensa anche lei come tanti critici che, con il suo romanzo incompiuto, lo

scrittore si trovasse a una svolta esistenziale? (ISST els034)

Do you also think like many literary critics that, with his unfinished romance,

the writer was at an existential turning-point?


- 117 -

6.1.3 Content Span

The selection of the content should obey to a principle of limiting the annotation to

that portion of text which is surely meant to be attributed to the source. This means

that the content span should not include utterances of uncertain attribution due to

syntactic ambiguities. An example is when a clause constituting the content is

joined to another utterance via a coordinating conjunction. In this case, only if the

complementizer ‘che’ (that) is included ( (151), (152)) the second clause is also

surely attributed, otherwise it could represent material added by the source above,

usually the writer.

(151) Più positive, invece, il giudizio di Fim-Cisl e Uilm-Uil, che hanno annunciato

per oggi una conferenza stampa e che sono favorevoli ad una votazione

referendaria sulla bozza di accordo. (ISST els002)

More positive, instead, the opinion of Fim-Cisl and Uilm-Uil, that have

announced a press release for today and that they are positive about a

referendum poll concerning the agreement draft.

(152) Lo ha detto ieri un portavoce del ministero degli Esteri, il quale ha anche

annunciato che il governo cinese ha protestato con quello degli Stati Uniti e

che si riserva il diritto di ulteriori reazioni. (ISST els075)

It was said yesterday by a spokesman of the Foreign Ministry, who has also

announced that the Chinese government has complained to the one of the

United States and that they reserve themselves the right of further

reactions.

Also part of the content span should be the IO of verbs requiring one, e.g. to order,

to forbid (153). In the example below in fact the prohibition would be incomplete

without the IO to which it is addressed. ‘Zagreb authorities’ did not prohibit ‘to go to

Petrinja’, this could even be considered an incorrect attribution, but they ‘forbid the

journalists to go to Petrinja’.

(153) E le autorità di Zagabria hanno proibito ai giornalisti di andare a Petrinja e

nelle altre località appena riconquistate. (ISST cs030)


- 118 -

And Zagreb authorities have forbidden journalists to go to Petrinja and the

other just reconquered places.

When the content span is separated by an incidental phrase or clause, it should be

annotated as a single markable, unless, as in (154), the content is also divided by

sentence boundaries. In this case it seems more appropriate the addition of the

second part of the attribution still to the same relation, though as a second content

markable.

(154) "There's no question that some of those workers and managers contracted

asbestos-related diseases," said Darrell Phillips, vice president of

human resources for Hollingsworth & Vose. "But you have to recognize

that these events took place 35 years ago. It has no bearing on our work

force today." (PDTB 0003)

The complementizer ‘that’ should always be included in the content span, together

with the quotation marks (155) (i.e. “…” or <<…>>). When source and cue are

expressed incidentally, surrounded by hyphens, these should also be included in

the content (155).

(155) E' vero che doveva interpretare lei la parte di Bruce Willis in Pulp Fiction ?

["Sì -] [si adombra] [Matt] [- Un ruolo interessante: con Tarantino eravamo a

buon punto, poi é arrivato Bruce. I suoi film incassano un po' più dei miei,

no? Hanno scelto lui"]…(ISST cs060)

Is it right that you were going to play the role of Bruce Willis in Pulp Fiction?

[“Yes -] [Matt] [grows dark] [- An interesting role: with Tarantino we were at

a good point, then Bruce arrived. His films cash in a bit more than mines,

right? They chose him”]…

Punctuation at the end of a content span should only be included if part of the

content itself. This means that for example a full stop at the end should be

included when the content is expressed by a full sentence, a question mark when

the content itself is a question (156) and so forth.


- 119 -

(156) Ø Sospende il racconto e formula una domanda, in inglese: “Sai cos’è un

rabbit?”. (ISST cs030)

(He) holds the narration and poses a question, in English: “Do you know

what a rabbit is?”.

6.1.4 Supplement

The supplement span is a useful device in order to account for optional additional

elements which although not fundamental in an attribution relation, they are in fact

often missing, do carry useful information. These can be: concurring to the

identification of the source and the provenance (157) or mean by which the

information was acquired; providing further specification of the attitude this holds;

the recipient of a reportive verb of the assertion type (e.g. to tell); and event

specifications providing context indications determinant to the interpretation and

comprehension of the content. The latter includes also instances like (158), where

the content has been asserted or expressed about a certain entity or event (‘it’). In

the example this element ‘it’ is necessary as required by the verb and in case of an

indirect quotation it could be included in the content. In this case however, it is not

directly part of it as the source has been talking about this event without

mentioning it. In the examples below the supplement span is in small capitals.

(157) (Ø) Ho saputo della squalifica di Garciano DA MAURIZIO DAMILANO, vi giuro,

non pensavo di arrivare primo. (ISST cs071)

(I) heard of the disqualification of Garciano FROM MAURIZIO DAMILANO, I

swear, I didn’t imagine I would have came first.

(158) <<Un'idea geniale>> L'ha definita Cesare Verlucca, editore piemontese

pronto ad affrontare i salotti dopo il successo di vendite ottenuto dal Salone.

(ISST sole040)

<<A genius idea>> has defined IT Cesare Verlucca, publisher from

Piedmont ready to face the ‘salotti’ after the sale success obtained at the

‘Salone’ fair.


- 120 -

6.2 Feature Annotation Guidelines

After selecting the text spans corresponding to the elements part of an attribution

relation it is necessary to assign the role to each markable in the ‘annotation

window’. When the role ‘cue’ is chosen, the window ( Figure BB) will display also

the attributes and their values which need to be assigned.

Figure BB - Annotation, attributes selection.

The features included in the attribution are summarised in Table 4. They are all

marked on the cue, although some refer to characteristics of the source, i.e.

‘source type’ and ‘source ID’.

Cue

Type Factuality Scopal change None Factual None Assertion Non-factual Scopal change Belief

Fact

Eventuality

Source type Source ID Source

Writer free text

Other Content

Arbitrary

Mixed Supplement

Table 4 - Annotation schema features


- 121 -

It was decided to proceed this way as it allows preventing a loss of information, or

the necessity to add dummy elements, in case the source is implicit and has no

corresponding text span to which the annotation could be anchored. The feature

‘scopal change’ is disabled when cues of the type ‘fact’ are selected. The

underlined values represent the default values.

6.2.1 Type Attribute

The type of attitude held by the source is by default ‘None’. In the annotation

window however, one of the four values this feature can assume, namely

assertion, belief, fact and eventuality, needs to be selected. For a more detailed

analysis of the issues involved in the selection of the type, see (4.1.5). Here some

strategies are presented which have been adopted in the pilot.

A direct quotation should always be marked as ‘assertion’ even though the

punctuation is not the only cue and other cues express a different attitude as in

(159). The preposition ‘per’ (for) and ‘secondo’ (according to) have also been

considered assertions, together with prepositional groups such as ‘stando a’

(according to), ‘a detta di’ (according to’, and so forth. Other prepositional groups,

e.g. ‘a parere di’ (160) (in the opinion of), ‘agli occhi di’ (in the eyes of) ‘nell’ottica

di’ (in the perspective of) have been instead marked as ‘belief’.


cs030)


(160) A suo parere, una particolare follia segnerebbe la continuità della

letteratura siciliana. (ISST els034)

In his opinion, a special folly marks(QUOT.COND.) the continuity of Sicilian

literature.

Verb cues need instead to be considered in context and annotated according to

the attitude they express. A first effort to collect Italian cues and identify their type

is presented in (6.3).


- 122 -

6.2.2 Factuality Attribute

The factuality attribute takes just two values: factual and non-factual. In order to

decide which value to assign, it is necessary to concentrate on the attribution

relation itself no matter what the content is. ‘Factual’ is by default the value

assigned, it is in fact more frequent, at least in journalistic texts, and represents

real attributions. In case the attribution relation is not a real bound but just an

hypothetical match or the negation of a link between source and content, it takes

the value ‘non-factual’. To summarise the analysis in (4.3.2), the factuality can be

compromised by the following elements when they scope on the cue:

� polarity reversing particle (negation, negative pronouns) (161)

� verb mode (conditional, imperative)

� verb tense (future)

� hypothetical (if)

� interrogative form (162)

� modals

(161) Nessuno parla più di baratro imminente e di crisi finanziaria. (ISST cs025)

No one is talking anymore about imminent precipice and financial crisis.

(162) Ø Ti dico una cosa: Ø sai qual è il nostro gioco preferito quando partiamo

per qualche operazione militare? (ISST cs030)

(I) tell you something: do (you) know what’s our favourite game when we

leave for some military operation?

The factuality judgement represents the answer to the following question: is the

content presented as attributed to the source in the real world? However, factuality

should be kept separate from evidentiality, thus elements like the quotative

conditional, or ‘sembra/ pare’ (it seems) which are employed by the outer source to

express that he or she has no direct evidence of the reported attribution. Although

evidentiality can be often perceived as affecting the epistemic modality, thus

reflecting a lower degree of certainty about the fact that the attribution really took

place, this alone should not be considered enough to reverse the factuality.


- 123 -

6.2.3 Scopal Change Attribute

Also the scopal change attribute can take two values, ‘none’ being the default one,

and ‘scopal change’ the other. A change in the scope happens relatively seldom,

however it is important to recognise it in order to avoid incorrectly considering it as

affecting the factuality. The scopal change almost solely occurs with polarity,

therefore it is opportune to pay particular attention to those attributions appearing

at first as non-factual because of the cue being in the scope of a negation. In these

cases it can be checked if there is a polarity change first with determining whether

there is still a perceived attribution and secondly with considering if the reverse of

the content is attributed. Both requirements are satisfied in (163).

(163) Qualunque sia il numero di sfollati, il governo croato nega che siano stati

espulsi e nega anche qualsiasi volontà di pulizia etnica nelle regioni appena

riconquistate. (ISST cs031)

Whatever the number of evacuees, the Croatian government denies that

they have been banned and denies also any ethnic cleansing intention in

the newly conquered areas.

The case when just the first requirement is not satisfied corresponds to a ‘non-

factual’ attribution. The way factuality and scopal change are intertwined is shown

in Table 5, where values are assigned to these features according to the

intersection of the above mentioned requirements, i.e. a perceived attribution and

the reverse of the content being the intended attributed material.

Factuality +Scopal change Attribution No attribution Content Factual + None Non-factual + None Content reverse Factual + Scopal change X Cue reverse Non-factual + Scopal change X

Table 5 - Factuality and Scopal change values assignment

When just the first requirement is satisfied, that is when there is a perceived intent

of attributing something which however does not correspond to the content or its

reverse as in (164), the annotation should mark the attribution as non-factual and

having a scopal change. It could in fact be considered as a factual attribution of a


- 124 -

negative attitude, in this case ‘not believing’, however, the attribution of the positive

attitude is not expressed. With marking these instances as ‘non-factual + scopal

change’ they have a unique combination of attributes making them retrievable and

easily distinguishable from simple ‘non-factual’ attributions. The ‘scopal change’

refers to the fact that it is not the polarity of the attribution that is affected nor that

of the content, but the polarity of the attitude held by the source.

(164) Le opposizioni non credono alla rinascita del tripartito ed insistono nella

richiesta di autoscioglimento. (ISST els038)

The oppositions do not believe in the re-birth of the three-party and insist

in asking its self-dissolution.

6.2.4 Source Type Attribute

Assigning a value to the ‘source type’ attribute is not a particularly complex task.

The source is by default ‘writer’ and can assume also the values: other, arbitrary

and mixed. ‘Writer’ should be assigned in case the attribution is overtly to the

writer of the article while ‘other’ refers to another defined entity, including very

general sources like ‘a man’ or ‘experts’. As ‘arbitrary’ should be marked those

instances without a specific source, i.e. impersonal or hidden sources such as

‘everyone’, ‘the people’, ‘one’ or pronouns like ‘you’ or ‘they’ when used as

impersonals. ‘Mixed’ should be instead used to mark when an attribution

possesses multiple sources of different type as in (165).

(165) Tutti, incluse le autorità, conoscono la loro provenienza, ma nessuno dice

e fa nulla per prevenire il massacro di capi selvatici. (cs.morph020)

Everyone, including the authorities, knows their provenance, but no one

says and does anything to prevent the massacre of wild animals.

6.3 Collecting a List of Italian Cues

Possessing a listing of Italian cues classified according to their type would allow

for example to perform the annotation of attribution, as it was done for the


- 125 -

annotation of the PDTB, with looking for each cue, one at a time, throughout the

whole corpus. In addition, this list would represent a database which could guide

the annotators in their task. Moreover, a collection of all the possible cues would

also help the automatic identification of attribution relations providing tools with a

lexical anchor to look for. The collection of a complete list is unfortunately not

feasible for a number of reasons and therefore cannot represent a reliable

instrument.

First of all, only the punctuation sequence colon-quotation mark almost

certainly corresponds to an attribution. All other lexical and grammatical categories

can assume the function of cue, however this is not always, and in some cases

only occasionally, the case. Secondly, although prepositions and prepositional

groups, together with punctuation, represent a close class, verbs, adjectives and

nouns are surely not and it is problematic to determine and list all the ones

possibly functioning as an attribution device. Lastly, to the most substantial class of

attribution cues, the verb, cannot be assigned a type a priori. Many verbs are

polysemous and embrace meaning which belong to different attribution types (see

4.1.5). The disambiguation can only rely on the context.

The generic verb ‘fare’ (to do) can also be used as reportive (Renzi, 1995),

although it is quite colloquial and limited to introducing reported direct speech as

for example ‘Mario mi fa: “Come stai?”’ (Mario goes: “How are you?”). More often it

is combined in an expression, e.g. ‘fare il punto’ (to define/clarify), ‘fare notare’

(166) (to point out), ‘fare riferimento’ (167) (to make reference), ‘fare il nome’ (168)

(to mention). This verb cannot be considered as a reliable indicator of an

attribution and is so frequent that it is not feasible to check every instance of it to

ensure it does not entail an attribution relation.

(166) A seguito dell’operazione, l’azionariato della Cementeria di Merone – fa

notare ancora il comunicato diffuso ieri – rimane invariato; (ISST sole103)

Following the transaction, the share of the Cement factory of Merone –

points out again the announcement released yesterday- stays unchanged;

(167) Sanpaolo fa riferimento al <<prezzo già concordato con Sasea di 10

miliardi>>… (ISST sole113)


- 126 -

Sanpaolo makes reference to the <<price already agreed with Sasea of 10

billiards>>…

(168) …sarebbero interessati acquirenti stranieri e si fa il nome di Bouygues).

(ISST sole117)

…foreign buyers are(QUOT.COND) interested and it was made the name of

Bouygues).

Reaching a satisfactory description of possible cues and their type seems at

present if not unfeasible quite unlikely to succeed. Nonetheless, a first effort to

collect a partial list of cues was made. Italian cues were partly taken from Renzi

(1995) and Knott (1996). To this first group, including some prepositions,

prepositional groups and verbs, were added cues found in the corpus and

deverbal noun cues derived from the listed verbs. In order to enlarge the list of

reportive verbs, English verbal cues were extracted from the PDTB.

The list of Italian cues is reported in Appendix 2. The inventory reports the

Italian cue, followed by its English equivalent in italic and, when possible, the

overall number of occurrences of the lemma in the corpus. In case of multi-word

expressions or generic verbs, this figure was not retrieved and it is signalled with

an ‘N’. Cues are classified according to their class, i.e. verbs, nouns, prepositions,

prepositional groups, grammatical markers, punctuation. Verbs have also been

classified according to their type. Some polysemous verbs are reported in more

than one type group. Their classification, however, is just a suggestion as the

context has to be first considered. The purpose of the classification is in fact not to

determine once and for all the type of each verb but to provide a list of members in

order to make it easier to identify the different types and confront any verb with

them.

6.3.1 Extracting Verb Cues from the PDTB

Verbal cues were extracted from the PDTB from files containing the attribution

phrase, i.e. source and cue, of each attribution relation. These files were parsed

with the POS tagger developed by the ‘Center for Sprogteknologi Kobenhavns

Universitet’, based on the Brill tagger and available open-source


- 127 -

(http://cst.dk/online/pos_tagger/), in order to obtain a list of tokens from a specific

word class: the verb. Afterwards, with the help of the CST’s lemmatizer

(http://cst.dk/online/lemmatiser/), using Celex traning data (© Max Planck Institute

for Psycholinguistics, 2001), a list of just the verb lemmas was obtained and

manually reviewed. The resulting list comprises about 470 verbs and is reported in

Appendix 3.

In order to enlarge the list of Italian reportive verbs, English verb cues from

the PDTB could be usefully employed. By confronting the inventory extracted from

the PDTB to the one already collected for Italian it would be possible to merge

these two lists.

6.4 Summary

The annotation process consists of different steps. First of all it is necessary to

identify in the article, displayed by the tool as plain text, an attribution relation.

Afterwards the text spans corresponding to the components of the attribution need

to be selected. To each selection, or markable, it is then assigned a role, i.e. cue,

source, content or supplement. On the source, in the annotation window in the

tool, the attribute have to be specified selecting the appropriate values. Lastly, the

markables belonging to the same attribution are connected in a relation set.

The selection of the text span corresponding to each markable is not free

from obstacles. Some indications were provided which should constitute a

guideline for eventual annotators. Source, content and cue have to include the

elements having that role and eventual modifiers. Additional useful spans can be

included as supplement.

More indications were given concerning the selection of the values for each

attribute. The selection of the ‘type’ feature cannot rely on a previously given

repository of cues classified according to the attitude they express as this depends

also on the context. The values for the feature ‘factuality’ should be selected with

bearing in mind the question: is there a perceived attribution? ‘Scopal change’ has

instead to be assigned to factual attributions reversing the polarity of the content,

but also attributions in which the negation is reversing the attitude and not

negating the existence of a link. The ‘source type’ is probably the least problematic


- 128 -

feature. Its values are assigned on the basis of the source being a determined

entity, either the ‘writer’ or ‘other’, or an impersonal or generic one, the latter being

usually employed to express hearsays. ‘Mixed’ is instead used to label the source

type of multiple instances of source belonging to different types.

Lastly, an attempt to list and classify Italian cues was presented (Appendix

2). In order to find more verb cues, the ones in the PDTB have been extracted and

are available to confront to the Italian list with the aim of adding the missing ones.

The outcome of this work is reported in Appendix 3.

7 Conclusion

- 129 -

7 Conclusion

This thesis originates from the intention to compensate for the lack of a discourse

level of annotation in the ISST corpus of Italian. In the frame of discourse relations,

attribution was chosen as the current topic not just because of its relevance, but

also in order to provide this phenomenon with a more complete investigation.

Studies involving attribution, the most relevant being the PDTB (Prasad, Miltsakaki

et al., 2008) and the Opinion Corpus (Wiebe, 2002), have till now approached this

matter only partially. This thesis provides instead a full account of the phenomenon

through an independent approach.

The aim was not only to develop a reliable annotation schema to apply to

the ISST, but also to contribute to the progress of IR and QA studies. The outcome

of this study is particularly relevant for example for works dealing with information

and committed to provide software capable of discerning reliable or relevant

sources in order to deliver quality data.

Attribution was analysed in all its linguistic manifestations, comprising

relations at the discourse level as well as intra-sentential ones involving smaller

units such as clauses, phrases and even words. The resulting image is that of a

very composite phenomenon which is only partially, in contrast with Skadhauge

and Hardt (2005) claims, syntactically inferable.

On a more tangible side, the present study resulted in the definition of an

annotation schema and the pilot annotation of a portion of the corpus on which its

feasibility was tested. The annotation schema originates from the one adopted in

the PDTB and partially departs from it in order to conform to the present

requirements. An accurate analysis of some available annotation tools also

accompanies the development of the pilot.

In order to facilitate the annotation process and the identification of

attribution a first list of Italian cues, namely the elements functioning as textual

anchor of each relation, was also collected and is included in the thesis.

7.1 Future Work

While this thesis lays the foundations of the attribution annotation project, it surely

7 Conclusion

- 130 -

does not represent its completion. The proposed annotation schema needs to be

tested and consequently perfected. This could be done with the help of one or

more annotators which should mark the same portion of the corpus already

annotated for the pilot, in order to verify through the agreement the clarity of the

schema and amend ambiguous tasks or distinctions. If possible, an ad hoc tool

should be developed in order to facilitate the annotation process even more and

provide all the desired features. Finally, the annotation should be performed on the

whole ISST corpus and statistically evaluated.

Some features have also proved to require better investigation. It would be

for example useful to better determine the conditions for a scopal change to

happen and identify other elements potentially able to superficially scope over the

cue or even source span and affecting instead the content. The interaction

between attribution coreference and event anaphora or attribution relations and

other discourse relations should also be further investigated as well as the role of

logical metonymy in some attribution cues.

The project could be also expanded so as to comprehend more features, or

more feature distinction. The source type attribute ‘other’ could be in fact further

specified so as to distinguish determined sources which can be identified, e.g.

proper names, important charges, institutions, from sources having instead a

generic or common referent, e.g. a man, experts, etc. Moreover, attribution could

be expanded so as to comprise feelings and emotions as in the Opinion Corpus

(Wiebe, 2002).

The Italian cue inventory should also be enlarged, first with merging it with

the PDTB verb inventory and subsequently with extracting all the cues in the ISST

corpus. This could be done once the annotation is completed as it would be then

possible to easily collect the cues already annotated and marked also for their

type.

7.1.1 And Beyond

Another aspect that deserves further investigation emerged during the annotation.

I realised that after a month or two browsing the corpus or performing the

annotation, and consequently reading the news in it, I was often getting confused. I

was in fact mixing events from the articles in the corpus, hence that happened

7 Conclusion

- 131 -

about fifteen years ago, with today happenings from the online news. The process

I observed could be regarded as evidence of the fact that we tend to remember

contents, or information, and forget how we acquired them. The temporal flattening

I experienced was also a source flattening: I could remember the information I

read but I was partly no longer able to discern between my sources: the ISST

corpus and the Web.

‘Who said that?’ is the question beneath this thesis, however, while trying to

answer it other questions arise as it becomes more and more evident that knowing

this answer alone is not sufficient. Knowing in which circumstances the attribution

event was real or took place is also necessary. The same source can think or

assert different things about the same event in different times, just imagine a

politician commenting about an issue before or after his party is elected or an

expert suggesting a particular investment and so forth. Sources may also assert or

express the same attitude, however in different circumstances (e.g. a formal/ funny

occasion, to audiences differing in expertise or bias, freewill or being threatened,

etc…) hence affecting the meaning and the way the content is perceived.

Anchoring attribution to the event it refers to, the circumstances (audience,

situation, etc.) in which it took place and the temporal dimension would allow e.g.

retrieving different opinions expressed by different sources about a same event

(e.g. historical happenings, political issues, etc…) together with the evolution of a

source’s thought concerning the same topic during a lapse of time. More

importantly, it would determine a more correct understanding of the content and its

real semantic significance and a consequent increased precision in the selection

of relevant and trustworthy information.

However far the stream flows, it never forgets its source.

(Nigerian Proverb)

- 132 -

Bibliography: Aikhenvald, A. Y., Evidentiality. Oxford: Oxford University Press, 2004. Bergler, S., “The semantics of collocational patterns for reporting verbs”. In

Proceedings of the Fifth Conference of the European Chapter of the Association for Computational Linguistics, Berlin, Germany, 1991.

Carlson, L., Marcu, D., Discourse tagging manual. ISI Tech Report ISI-TR-545,

2001. Available at: http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf

Carlson, L., Marcu, D., Okurowski, M. E., “Building a Discourse-tagged Corpus in

the Framework of Rhetorical Structure Theory”. In van Kuppevelt, J., Smith, R., Current Directions in Discourse and Dialogue, pp. 85-112, 2003.

Caselli, T., Ide, N., Bartolini, R., “A Bilingual Corpus of Inter-linked Events”. In

Proceeding of the 6th International Conference on Language Resources Evaluation (LREC 2008), Marrakech, Morocco, 28-30 May, 2008.

Cristea, D., Webber, B., “Expectations in Incremental Discourse Processing”. In

Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pp. 88-95 Madrid, Spain, 1997.

Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., “GATE: A Framework

and Graphical Development Environment for Robust NLP Tools and Applications”. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL ’02). Philadelphia, July 2002.

De Haan, F. “Coding of Evidentiality” / “Semantic distinctions of Evidentiality”. In

Haspelmath, M., Dryer, M., Gil, D., Comrie, B. (eds.), The World Atlas of Language Structures Online. Munich: Max Planck Digital Library, chapter 77-78, 2008. Available at: http://wals.info/feature/77 (/78).

Forbes, K., Miltsakaki, E., Prasad, R., Sarkar, A., Joshi, A., Webber, B., “D-LTAG

System: Discourse Parsing with a Lexicalized Tree Adjoining Grammar”. Journal of Language, Logic and Information, 12(3), 2003.

Gajewski, J., “Neg-rising Predicates are Definite Plural World Descriptions”. Sinn

und Bedeutung, 9, Nijmengen, The Netherlands, 2004. Giacalone Ramat, A., Topadze, M., “The Coding of Evidentiality: A Comparative

Look at Georgian and Italian”. In Rivista di Linguistica Italiana 19,1, special issue on Evidentiality between lexicon and grammar, edited by Mario Squartini, 2007.

Grimes, J., The Thread of Discourse. The Hague: Mouton, 1975.

- 133 -

Grosz, B. J., Sidner, C. L., “Attention, Intention, and the Structure of Discourse”. Computational Linguistics, 12 (3): 175-204, 1986.

Halliday, M. A. K., Hasan, R., Cohesion in English. London: Longman UK group

Limited, 1976. Hobbs, J. R., On the Coherence and Structure of Discourse. Technical Report

CSLI-85-37, Center for the Study of Language and Information, Stanford University, 1985.

Hunter, J., Asher, N., Reese, B., Denis, P., “Evidentiality and intensionality: Two

uses of reportative constructions in discourse”. Presented at the 2006 Workshop on Constraints in Discourse, Maynooth, Ireland, July 7-9, 2006.

Kiparsky, C, Kiparsky, P., “Fact”. In Jakobovits, L., Steinberg, D. (eds.), Semantics:

An Interdisciplinary Reader in Philosophy, Linguistics and Psychology, Cambridge: Cambridge University Press, pp.345-369, 1971.

Knott, A., A Data-driven Methodology for Motivating a Set of Coherence Relations.

PhD thesis, Department of Artificial Intelligence, University of Edinburgh, 1996.

Lee, A., Joshi, A., “Systematic Mismatches Across Annotations”. ULA Workshop,

University of Colorado, Boulder, March, 2008 Levin, B., English Verb Classes and Alternations: A Preliminary Investigation.

Chicago and London: The University of Chicago Press, 1993. Mann, W. C., Thompson, S. A., “Rhetorical Structure Theory: A theory of text

organization”. In Polanyi, L. (ed.) The Structure of Discourse, Ablex, 1988. Mladová, L., Zikánová, Š., Hajičová, E., “From Sentence to Discourse: Building an

Annotation Scheme for Discourse Based on Prague Dependency Treebank”. In Proceeding of the 6th International Conference on Language Resources Evaluation (LREC 2008), Marrakech, Morocco, 28-30 May, 2008.

Montemagni, S., Barsotti, F., Battista, M., Calzolari, N., Corazzari, O., Lenci, A.,

Zampolli, A., Fanciulli, F., Massetani, M., Raffaelli, R., Basili, R., Pazienza, M. T., Saracino, D., Zanzotto, F., Mana, N., Pianesi, F., Delmonte, R., “Building the Italian Syntactic-Semantic Treebank”. In Anne Abeillé (ed.), Building and using Parsed Corpora, Language and Speech series, Kluwer, Dordrecht, pp. 189-210, 2003.

Moore, J. D., Pollack, M. E., “A Problem for RST: The Need for Multi-Level

Discourse Analysis”. Computational Linguistics, 18 (4), 537-544, 1992.

- 134 -

Moore, J. D., Wiemer-Hastings, P., “Discourse in Computational Linguistics and Artificial Intelligence”. In A. C. Graesser, M. A. Gernsbacher, S. R. Goldman (Eds.), Handbook of Discourse Processes, London: Lawrence Erlbaum, pp. 439-485, 2003.

Mueller, C., MMAX2 Annotation Tool - Quickstart Guide and Style Sheet Guide, EML Research gGmbH, 27th – 28th October 2004. Available at: http://mmax2.sourceforge.net/

Mueller, C., MMAXQL The MMAX2 Query Language – Reference Manual (draft),

EML Research gGmbH, 12th August 2004. Available at: http://mmax2.sourceforge.net/

Mueller, C., Strube, M., “MMAX: A Tool for the Annotation of Multi-modal Corpora”.

In Proceedings of the 2nd IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Seattle, Washington, pp.45-50, August 2001.

Mueller, C., Strube, M., “Multi-level Annotation of Linguistic Data with MMAX2”. In

Braun, S., Kohn, K., Mukherjee, J. (Eds.), Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods (English Corpus Linguistics, vol.3), Frankfurt: Peter Lang, pp.197-214, 2006.

Murphy, A. C., “Markers of attribution in English and Italian opinion articles: A

comparative corpus-based study”. ICAME Journal vol. 29 pp. 131-150, 2005. Ogren, P. V., “Knowtator: A Protégé Plug-in for Annotated Corpus Construction”. In

Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Morristown, USA, pp.273-275, 2006

Péry-Woodley, M.P., Scott, D., “Computational Approaches to Discourse and

Document Processing”.T.A.L. 47(2): 7-19, 2006. Polanyii, L., Scha, R. J. H., “On the Recursive Structure of Discourse”. In Ehlich,

K., van Riemsdijk, H. (eds.), Connectedness in Sentence, Discourse and Text, Tillburg:Tillburg University, pp. 141-178, 1983.

Poesio, M., Delmonte, R., Bristot, A., Chiran, L., Tonelli, S., The VENEX Corpus of

Anaphora and Deixis in Spoken and Written Italian (draft), 2009. Available at: http://cswww.essex.ac.uk/staff/poesio/

Prasad, R., Dinesh, N., Lee, A., Joshi, A., Webber, B., “Attribution and its

Annotation in the Penn Discourse TreeBank”. In Traitement Automatique des Langues, Special Issue on Computational Approaches to Document and Discourse, vol. 47, no. 2:43-64, 2007.

Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., Webber, B.,

“The Penn Discourse TreeBank 2.0”. In Proceeding of the 6th International

- 135 -

Conference on Language Resources Evaluation (LREC 2008), Marrakech, Morocco, 28-30 May, 2008.

Prasad, R., Husain, S., Sharma, D.M., Joshi, A., “Towards an Annotated Corpus of

Discourse Relations in Hindi”. In Proceeding of IJCNLP Workshop on Asian Language Resources, Hyderabad, India, 2008.

Prasad, R., Miltsakaki, E., Dinesh, N., Lee, A., Joshi, A., Robaldo, L., Webber, B.

The Penn Discourse Treebank 2.0. Annotation Manual. IRCS Technical Report IRCS-08-01. Institute for Research in Cognitive Science, University of Pennsylvania, 2008.

Prasad, R., Miltsakaki, E., A., Joshi, A., Webber, B, “Annotation and data mining of

the Penn Discourse TreeBank”. In ACL Workshop on Discourse Annotation, Barcelona, Spain, 2004.

Renzi, L., Salvi, G., Cardinaletti, A., Grande Grammatica Italiana di Consultazione.

Bologna: Il Mulino, vol. III: 431-436,1995. Sag, I. A., Pollard C., “An Integrated Theory of Complement Control”, Language,

vol. 67, n° 1: 63-113, 1991. Saurí, R., Pustejovsky, J., “From Structure to Interpretation: A Double-layered

Annotation for Event Factuality”. In Proceedings of the 6th International Conference on Language Resources Evaluation (LREC 2008), Marrakech, Morocco, 28-30 May, 2008.

Schmidt, T., “The Transcription System EXMARaLDA: An Application of the

Annotation Graph Formalism as the Basis of a Database of Multilingual Spoken Discourse”. In Bird, S., Buneman, P., Liberman, M. (ed.), Proceedings of the IRCS Workshop on Linguistic Databases 11-13 December 2001, Philadelphia: Institute for Research in Cognitive Science, University of Pennsylvania, pp. 219-227, 2001. Available at: http://www1.uni-hamburg.de/exmaralda/files/IRCS_Paper.pdf

Skadhauge, P. R., Hardt, D., “Syntactic Identification of Attribution in the RST

Treebank”. In Proceedings of the 2nd International Joint Conference on Natural Language Processing, Jeju Island, Korea, 11-13 October, 2005.

Soria, C., Ferrari, G., “Lexical marking of discourse relations - some experimental

findings”. In Proceedings of the Conference workshop Discourse Relations and Discourse Markers at COLING-ACL'98, pp. 36-42. Montréal, Québec, Canada, 1998.

Talmy, L., Toward a Cognitive Semantics. Cambridge, MA: Massachusetts Institute

of Technology, vol. II, 2000. Webber, B., “Accounting for Discourse Relations: Constituency and Dependency”.

In Butt, M., Dalrymple, M. and King, T., Intelligent Linguistic Architectures,

- 136 -

Stanford: CSLI Publications, pp. 339-360, 2006. Webber, B., Joshi, A., Miltsakaki, E., Prasad, R., Dinesh, N., Lee, A., Forbes, K., “A

Short Introduction to the Penn Discourse TreeBank”. In Copenhagen Working Papers in Language and Speech Processing, 2005.

Webber, B., Stone, M., Joshi, A., “Anchoring a Lexicalized Tree-Adjoining

Grammar for Discourse”. In Coling/ACL Workshop on Discourse Relations and Discourse Markers, Montreal, Canada, pp. 86-92, 1998.

Webber, B., Stone, M., Joshi, A., Knott, A., “Anaphora and Discourse Structure”.

Computational Linguistics 29: 545-587, 2003. Webber, B., Stone, M., Joshi, A., Knott, A., “Discourse Relations: A Structural and

Presuppositional Account Using Lexicalised TAG”. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistic, College Park, Maryland, pp.41-48, 1999.

Wiebe, J., Instructions for annotating opinions in newspaper articles. Technical

report TR-02-101, Department of Computer Science, University of Pittsburgh, 2002.

Wiebe, J., Wilson, T., Cardie, C., “Annotating Expressions of Opinions and

Emotions in Language”. Language Resources and Evaluation 1(2), 2005. Williams, S., Power, R., “Deriving Rhetorical Complexity Data from the RST-DT

Corpus”. In Proceedings of the 6th International Conference on Language Resources Evaluation (LREC 2008), Marrakech, Morocco, 28-30 May, 2008.

Wilson, T., Wiebe, J., “Annotating Attributions and Private States”. In Proceedings

of the ACL Workshop on Frontiers in Corpus Annotation II: Pie in the Sky, Ann Arbor, Michigan, 2005.

Wolf, F., Gibson, E., “Representing Discourse Coherence: A Corpus-based Study”.

Computational Linguistics 31:249-287, 2005. Xue, N., “Annotating Discourse Connectives in the Chinese Treebank”. In

Proceedings of the ACL Workshop on Frontiers in Corpus Annotation II: Pie in the Sky, Ann Arbor, Michigan, 2005.

Zeyrek, D., Webber, B., “A Discourse Resource for Turkish: Annotating Discourse

Connectives in the METU Corpus”. Paper presented at The 6th workshop on Asian Language Resources, The 3rd International Joint Conference on Natural Language Processing (IJNLP), Hyderabad, India, 2008.

- 137 -

Abbreviations and Acronyms AO Abstract Object AR Attribution Relation Arb Arbitrary Arg1 Argument 1 CDTB Chinese Discourse Treebank Comm Communication (attribution type) cs Corriere della Sera Ctrl Control (attribution type) D-LTAG Lexicalised Tree-Adjoining Grammar for Discourse DPT Discourse Parse Tree DS Discourse Segment DTD DocumentType Definition EDU Elementary Discourse Unit els else EXMARaLDA EXtensible MARkup Language for Discourse Annotation Ftv Factive (attribution type) GATE General Architecture for Text Engineering IO Indirect Object IR Information Retrieval ISST Italian Syntactic-Semantic Treebank ITB Italian TimeBank IWN ItalWordNet LDM Linguistic Discourse Model MTC METU Turkish Corpus NL Natural Language NP Noun Phrase Ot Other PAtt Propositional Attitude (attribution type) PDT Prague Dependency Treebank PDTB Penn Discourse TreeBank period Periodicals POS Part Of Speech PTB Penn TreeBank QA Question Answering re Repubblica RR Rhetorical Relation RST Rhetorical Structure Theory RST-DT Rhetorical Structure Theory Discourse Treebank sole Il Sole 24 Ore Sup1 Supplement 1 TAG Tree Adjoining Grammar WALS World Atlas of Language Structures Wr Writer WSJ Wall Street Journal XML EXtensible Markup Language XSL EXtensible Stylesheet Language

- 138 -

Appendix 1 – MMAX2 Code

MMAX2 StyleSheet

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" xmlns:mmax="org.eml.MMAX2.discourse.MMAX2DiscourseLoader" xmlns:Attribution_relation="www.eml.org/NameSpaces/Attribution_relation"> <xsl:output method="text" indent="no" omit-xml-declaration="yes"/> <xsl:strip-space elements="*"/> <xsl:template match="words"> <xsl:apply-templates/> </xsl:template> <xsl:template match="word"> <xsl:value-of select="mmax:registerDiscourseElement(@id)"/> <xsl:apply-templates select="mmax:getStartedMarkables(@id)" mode="opening"/> <xsl:value-of select="mmax:setDiscourseElementStart()"/> <xsl:apply-templates/> <xsl:value-of select="mmax:setDiscourseElementEnd()"/> <xsl:apply-templates select="mmax:getEndedMarkables(@id)" mode="closing"/> <xsl:text> </xsl:text> </xsl:template> <xsl:template match="Attribution_relation:markable" mode="opening"> <xsl:value-of select="mmax:addLeftMarkableHandle(@mmax_level, @id, '[')"/> </xsl:template> <xsl:template match="Attribution_relation:markable" mode="closing"> <xsl:value-of select="mmax:addRightMarkableHandle(@mmax_level, @id, ']')"/> </xsl:template> </xsl:stylesheet>

MMAX2 Scheme

<?xml version="1.0" encoding="UTF-8"?> <annotationscheme> <attribute id="role" name="attribution_role" type="nominal_button">

<value name="None"/> <value name="Cue" next="attribution_type, attribution_factuality, source_type, source_uniID"/> <value name="Source"/> <value name="Content"/> <value name="Supplement"/>

</attribute> <attribute id="attribution_type" name="Type" type="nominal_list">

<value name="None"/> <value name="Assertion" next="attribution_scopal_change"/> <value name="Fact"/> <value name="Belief" next="attribution_scopal_change"/> <value name="Eventuality" next="attribution_scopal_change"/>

</attribute>

- 139 -

<attribute id="attribution_factuality" name="Factuality" type="nominal_button"> <value name="Factual"/> <value name="Non-factual"/>

</attribute> <attribute id="attribution_scopal_change" name="Scopal_change" type="nominal_button">

<value name="None"/> <value name="Scopal_change"/>

</attribute> <attribute id="source_type" name="Source" type="nominal_list">

<value name="Writer"/> <value name="Other"/> <value name="Arbitrary"/> <value name="Mixed"/>

</attribute> <attribute id="source_uniID" name="source_ID" type="freetext">

<value id="source_ID_uni" name="source_ID"/> </attribute> <attribute id="Relation" name="Relation" type="markable_set" style="rcurve" color="red">

<value name="Relation"/> </attribute> </annotationscheme>

MMAX2 Customization <?xml version="1.0" encoding="UTF-8"?> <customization> <rule pattern="{all}" style="foreground=blue handles=black bold=true "/> <rule pattern="attribution_role={cue}" style="background=orange" /> <rule pattern="attribution_role={content}" style="background=cyan" /> <rule pattern="attribution_role={source}" style="background=green" /> <rule pattern="attribution_role={supplement}" style="background=lightGray" /> </customization>

- 140 -

Appendix 2 – Italian Attribution Cues

Verb Cues Assertion

accusare accuse 35 iniziare to start 35 affermare to assert 41 insinuare to insinuate 2 aggiungere to add 77 invocare to invoke 8 ammettere to admit 40 lamentare to lament/complain 9 annunciare to announce 72 mormorare to murmur 3 apostrofare to address 0 mostrare to show 49 asserire to assert 2 narrare to narrate 2 augurare to wish 11 Negare (-) to deny 14 avvertire to warn 20 nominare to mention 26 avvisare to warn 4 osservare to observe 28 bisbigliare to whisper 0 parlare to talk 181 borbottare to mumble 0 proporre to propose 60 chiacchierare to chat 4 raccontare to tell 61 chiarire to clarify 19 replicare to reply 11 chiedere to ask 152 riassumere to sum up 16 cominciare to commence 95 ribattere to talk-back 3 commentare to comment on 31 ricominciare to start over again 6 comunicare to communicate 15 riconoscere to acknowledge 29 concludere to conclude 56 riferire to relate 42 condividere share 11 rimproverare to reproach 3 confermare to confirm 80 ripetere to repeat 36 continuare to continue 107 riportare to report 38 controbattere to talk-back 0 riportare to account 38 declamare to rave 1 riprendere to resume 41 denunciare to denounce 30 rispondere to answer 83 dichiarare to declare 69 rivelare to reveal 34 dire to say 532 sbottare to burst out 0 domadare to ask 5 scrivere to write 112 elogiare to praise 0 seguitare to continue 1 esclamare to exclaim 0 soggiungere to add 0 esprimere to express 39 sostenere to claim 75 fare to do/say N spiegare to explain 115 gridare to shout 15 testimoniare to testify 9 informare to inform 11 urlare to shout 8 Belief

credere to believe 81 dubitare to doubt 2 immaginare to imagine 18 pensare to think 134 ponderare to ponder 0 riflettere to think 18 supporre to assume 2

- 141 -

Fact

apprendere to learn 3 capire to understand 81 constatare to ascertain 4 dimenticare to forget 25 dimostrare to prove 43 essere a conoscenza to know N evidenziare to point out 28 leggere to read 48 notare to note 15 osservare to observe 28 rendersi conto to realise N ricordare to remember 105 rilevare to point out 34 rimpiangere to regret 2 sapere to know 182 sentire to hear/ feel 89 udire to hear 3 vedere to see 284 venire a conoscenza to get to know N venire in mente to remember N Eventuality

accettare to accept 48 invocare to invoke 8 acconsentire to agree 0 lasciare to let 121 accordarsi to arrange 0 minacciare to threaten 24 appoggiare to support 8 ordinare to order 9 aspettarsi to expect 44 permettere to allow 46 assicurare to assure 33 persuadere to persuade 1 augurarsi to wish oneself N pregare to pray 1 bramare to long for 0 promettere to promise 27 comandare to command 1 provare to prove 27 concordare to agrree/arrange 16 raccomandare to recommend 1 condividere share 11 rassicurare to reassure 10 consentire to allow 81 rifiutare to disagree 24 consigliare to advise 23 riportare to account 38 convenire to agree 8 sospettare to suspect 7 declinare to refuse 1 sostenere to support 75 desiderare to wish 6 sperare to hope 47 discordare to disagree 0 suggerire to suggest 22 essere d’accordo to agree N supplicare to plea 0 implorare to beg 1 temere to fear 30 imporre to impose 42 volere to want 626 intendere to intend 49

- 142 -

Other Cues Noun Markers

acclamazione applause 1 mormorio murmuring 0 accordo agreement 109 narrazione narration 0 affermazione assertion 10 nota note 71 ammirazione admiration 2 opinione opinion 21 ammissione admission 2 ordine order 93 annuncio announcement 13 osservazione observation 12 appello appeal 20 parola word 76 appoggio support 4 patto pact 19 apprezzamento appreciation 4 paura fear 31 approvazione approval 20 pensiero thought 11 aspettativa expectation 5 permesso permission 4 augurio wish 5 persuasione persuasion 0 avvertimento warning 3 petizione petition 0 certezza certainty 15 plauso approval 0 chiarimento clarification 6 posizione position 71 comando command 14 preghiera pray 1 commento comment 41 promessa promise 9 comunicato release 19 proposta proposal 73 congettura conjecture 0 punto di vista point of view N conoscenza knowledge 15 raccomandazione recommendation 2 consenso consensus 26 racconto story 12 consiglio advice 136 rassicurazione reassurement 0 constatazione realization 3 replica reply 4 credenza belief 1 resoconto account 2 deposizione deposition 2 ricordo memory 9 desiderio desire 4 rifiuto refusal 6 dichiarazione declaration 59 riflessione reflexion 13 dimostrazione demonstration 8 rilevazione survey 5 discordanza disagreement 1 rimpianto regret 0 disprezzo contempt 1 risposta answer 43 domanda question 42 rivelazione revelation 8 dubbio doubt 27 segnalazione signalling 9 esclamazione exclamation 0 sensazione feeling 12 grido shout 8 sostegno support 15 idea idea 55 speranza hope 23 illazione insinuation 0 suggerimento suggestion 4 implorazione imploration 0 supplica plea 0 imposizione imposition 3 supporto support 12 informazione information 89 supposizione supposition 0 insinuazione insinuation 2 testimonianza testimony 9 intesa agreement 39 timore fear 13 invocazione invocation 1 urlo shout 4 lamentela complaint 2 visione point of view 5 lode praise 0 visione vision N memoria memory 19 volontà will 25

- 143 -

Prepositions Prepositional groups Punctuation Mode secondo (348)

according to

per quanto riguarda as far as it concerns <<...>> (824)

quotative conditional

per (3236) for a detta di according to “...” (1783) agli occhi di in the eyes of nell’ottica di in the perspective of a parere di in the opinion of stando a according to

- 144 -

Appendix 3 – PDTB Verb Cues

accept blame cover exist impose accompany boast create expect include accord bolster credit expire increase accuse break criticize explain indicate acknowledge broadcast cut express indict acquire brook date favour inform act build dawn fear insinuate add bury decide feed insist address buy declare feel inspire admit calculate decline figure integrate adopt call deem file intend advance capture defend finance interject advertise caution define find interpret advise challenge demand finger interview affect characterize deny fireproof introduce agree charge describe float invent allege chastise design flock invest allow check determine fly investigate amend cheque develop follow involve analyse chuckle disappoint forecast issue anger circulate disclose foresee join announce cite discover form joke ansie claim dismay found jump anticipate classify dispute franchise keep appear clear diversify free kill appoint close doubt fret know appreciate come draft future lament argue comment draw gain laud arise compare dream get laugh ask compile drill give launch assert complain drip go lay assist concede drop gripe lead associate concern dub grow leak assume conclude earn gush lean assure concur eat hamper learn attach conduct echo hand lease attend confess elaborate handle leave attest confide emphasize head lecture attribute confirm empty hear left auction consider emulate help legislate avert consult encourage highlight light avoid contain end hint like base contend erupt hire link be contest establish hit liquidate bear continue estimate hold list become control evaluate hope lit begin convince evince identify locate believe copy examine ignore lose bet count exclaim illustrate love bid counter exclude imply maintain

- 145 -

make programme represent spur voice manage project request stand volunteer mark promise require start vote market promote research state vow mean prompt resort stem walk meet propose respond stop want mention prosecute restore strengthen warn monitor protect retail stress watch motor prove retire strike waver muse provide reveal stroke wear name publish rid study welcome negotiate pull rob stun win nickname purchase rule suggest wonder note push run suit work notice put say summarize worry notify question scare supply write observe quip scoff support yell offer quote score survey operate raise scotch oppose rally scream swap order rave see swear organize reach seek swivel originate read seem take oversee realize sell talk own reason send teach participate rebuild serve tell pay recall set tend perishables receive shake test permit reckon shout testify persuade recognize shove theorize pick recommend show think place reconstruct shrug threaten plan record sidestep title plant recount sigh tout play recruit sign track pledge reduce signal trade plot refer slide trail point reflect snap travel poll regard sniff trouble ponder reign snort trundle pour reiterate solicit try practise reject specialize understand praise relate specify unleash predict release speculate urge prepare relieve spell use present remark spend vacate prime remember spill value proclaim remind sponsor verify produce repeat spot view profess reply spread visit