18
Proper Nouns in Czech Corpora Magda Ševčíková [email protected] Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University in Prague Czech Republic

Proper Nouns in Czech Corpora Magda Ševčíková [email protected] Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

Embed Size (px)

Citation preview

Page 1: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

Proper Nouns in Czech Corpora

Magda Ševčíková[email protected]

Institute of Formal and Applied LinguisticsFaculty of Mathematics and PhysicsCharles University in PragueCzech Republic

Page 2: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

Outline

Introduction Proper nouns in corpora of Czech: current

state Corpus SYN2000 Prague Dependency Treebank 2.0

Proposal of a complex proper noun annotation within the Prague Dependency Treebank 2.0

Final remarks

Page 3: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

Introduction

proper nouns lacking a generic meaning denoting individuals, institutions etc. identifying them as unique items

proper nouns in NLP question answering information extraction machine translation

• pan Zelený should not be translated into Mr Green• Frankfurt am Main or Frankfurt nad Mohanem, but not a

combination of both (e.g., Frankfurt nad Main) explicit annotation of proper nouns needed

Page 4: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

Proper nouns in corpora of Czech: current state

two large corpora of Czech as sources of proper nouns:

SYN2000 100 million tokens morphological annotation

• morphological lemmas and positional tags no explicit annotation of proper nouns

Prague Dependency Treebank 2.0 (PDT 2.0) morphologically and syntactically annotated very basic annotation of proper nouns

• at the morphological layer• at the deep-syntactic (tectogrammatical) layer

Page 5: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

Proper nouns in SYN2000http://ucnk.ff.cuni.cz

proper nouns were not marked other characteristics used for searching for

proper nouns capitalization

• only proper nouns capitalized in Czech (in comparison, e.g., to German)

• however, it is not a sufficiently distinctive feature (sentence beginnings)

context patterns• for instance, Mr Xxx / President Xxx

Page 6: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

Searching SYN2000for

Query Number ofoccurrence

sin SYN2000

Precision in 500 randomlyselected occurrences (in %)

names/surnames(or their parts)

lemmas pan/paní/slečna (Mr/Mrs/Miss)followed by a capitalized token

41,574 99.4

names/surnames(or their parts)

(un)capitalized short versions of Czechacademic titles doc./dr./ing./JUDr./MUDr./prof./RNDr. followed by a capitalized token

26,394 96.0

names/surnames(or their parts)

lemmas of academic titles doktor/profesor/docent/inženýr (doctor/professor/docent/engineer) followed by a capitalized token

9,123 94.6

town names (or their parts)

digit combination corresponding to Czech zip code format followed by a capitalized token

7,954 92.6

street/square names(or their parts)

lemmas ulice/náměstí (street/square) followed by a capitalized token

6,233 87.6

street names(or their parts)

(un)capitalized abbreviation ul. (for street)followed by a capitalized token

696 87.4

company names(or their parts)

abbreviation s.r.o. (for Ltd.) preceded by acapitalized token (and optionally by a comma)

2,554 100.0

company names(or their parts)

abbreviation a.s. (for PLC) preceded by acapitalized token (and optionally by a comma)

4,274 99.6

Page 7: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

Proper nouns in PDT 2.0http://ufal.mff.cuni.cz/pdt2.0/

basic annotation of proper nounsat the morphological layer

• each token was assigned a morphological lemma and a positional tag

• lemma flag for marking of proper nounsat the tectogrammatical layer

• each sentence represented by a labeled dependency tree structure (consisting of nodes and edges)

• special means for annotation of selected phenomena concerning proper nouns

Page 8: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

PDT 2.0: Morphological layer

proper noun type indicated by a value of a special flag which was attached to lemmas of proper nouns by a separator Jan_;Y, Zelený_;S

seven flag values first names, surnames, inhabitant names, geographical

names, institution names, product names, other names convenient for annotation of one-word proper nouns insufficient for more complex proper nouns

misinterpretations: • Frankfurt_;G nad Mohanem_;G• Vysoký_;K škola ekonomická (University of Economics)

Page 9: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

PDT 2.0: Tectogrammatical layer

no complex annotation of proper nouns annotation means for selected phenomena only

person names• node attribute is_name_of_person

non-inflected street names, book titles etc. accompanied by a generic noun

• functor ID book titles etc. which have a form of a prepositional

group and are not accompanied by a generic noun• an ‘artificial’ node with lemma #Idph

besides these individual cases, proper nouns were treated as common parts of a sentence

Page 10: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

(c) Šli jsme ulicí Spálená (We walked through the street.instr Spálená.nom)

(d) Šli jsme ulicí Spálenou (We walked through the street.instr Spálená.instr)

(e) Šli jsme Spálenou (We walked through Spálená.instr)

(instr for instrumental case, nom for nominative case)

(a) person name Klára Nováková Malá

(b) V sobotu v poledne je hezký film (lit.: ‘On Saturday at Noon’ is a nice film)

(a) (b)

(c) (d) (e)

Page 11: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

Proposal of a complex proper noun annotation within PDT 2.0

proper noun type defined at each proper nounproper noun classification

annotation of one-word proper nouns as well as more complicated proper noun structures four structure types to be annotated

the inner structure of more complex proper nouns described as a non-dependency relation

tectogrammatical layer

Page 12: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

Proper noun classification for Czech

two-level classification 1st level: five super-types of proper nouns

• personal names, geographical names, institution names, artefact names, media names

• (+ two more types: temporal expressions, numerical expression occurring in postal addresses)

2nd level: proper noun types• e.g., types of geographical names: street/square names,

city/town names, state names etc.• underspecification allowed• each type encoded by a unique two-character tag

• gs for street/square names, gu for city/town names• g_ for a geographical name of an unknown type

Page 13: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

Structure types to be annotated

(i) one-word proper nouns John

(ii) multi-word proper noun expressions Vysoká škola ekonomická (University of

Economics)

(iii) complex proper noun expressions Frankfurt nad Mohanem

(iv) containers Jan Zelený

Page 14: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

(i) Annotation of one-word proper nouns

proper noun type indicated at each proper noun

new node attribute: NE_roles value set corresponds to all proper noun

type tags (and container tags) substitutes the current is_name_of_person

attribute

Page 15: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

(ii) Annotation of multi-word proper noun expressions

every constituent of a multi-word proper noun expression has a node of its own

at all nodes, the same value of the NE_roles attribute occurs

edges in the sub-tree labeled with a new functor NEPART

syntactic function of the whole expression indicated by the functor of the governing node

Vyučuje na Vysoké škole ekonomické (He teaches at University of Economics)

Page 16: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

(iii) Annotation of complex proper noun expressions

every constituent has a node of its own

a main part (Frankfurt) and an embedded part (Mohan)

type of the embedded part indicated by the value of the NE_roles attribute at the embedded part, type of the whole expression at the main part

relation between the main and the embedded part labeled with the NEPART functor Navštívil Frankfurt nad Mohanem

(He visited Frankfurt am Main)

Page 17: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

(iv) Annotation of containers

the #Idph node as the governing node of the whole container

container type indicated by the value of the NE_roles attribute at the #Idph node

proper noun types of the constituents defined by the values of their belonging NE_roles attributes

relations between the #Idph node and constituents labeled with the NEPART functor

Novým ředitelem je Jan Zelený (Jan Zelený is the new director)

Page 18: Proper Nouns in Czech Corpora Magda Ševčíková sevcikova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics

[email protected] Linguistics 2007, July 30

Final remarks

annotation of proper nouns in corpora linguistic research NLP subtasks

complex proper noun annotation within PDT 2.0 tectogrammatical layer more convenient than the

morphological one annotation means and rules proposed

future work further elaborate the proposed means and rules manual annotation of sample data development of automatic annotation tools