Named Entity Recognition for Romanian

Adrian Iftene1, Diana Trandabăț1

, Mihai Toader2, Marius Corîci2

KEPT Conference, 4-6 July, Cluj-Napoca, RomaniaBabes-Bolyai University

11“Al. I. Cuza”, University of Ia“Al. I. Cuza”, University of Iassi, i, RomRomaaniania11FacultFaculty of Computer Science y of Computer Science

22Intelligentics, Cluj-Napoca, Intelligentics, Cluj-Napoca, RomaniaRomania

The problem that we address Proposed Solution Named Entity Recognition

◦ Named Entity Identification◦ Named Entity Classification◦ Evaluation (Upper Bound, Real Context)

Examples Conclusions

We want to find out users’ opinions on various products, events or persons:◦ I want to buy a certain product (e.g. iPhone). What are

its strengths and weaknesses? What are the opinions of persons who have used it already?

◦ I am the manager of a big company. I am interested in what people say about my company. Which products have good or bad impact? What policy to adopt next?

◦ I am a candidate for next elections. What should I change in my political discourse? Why a part of voters do not like me?

KEPT Conference, 4-6 July, Cluj-Napoca

LiSS Conference, 3-5 May, Iasi

NER - task which finds textual expressions such as the names of persons, organizations, locations, places, etc.

Existing work:◦Statistical models (Nadeau and Sekine, 2007) -

require a large amount of manually annotated training data

◦Machine learning techniques (Scurtu et al. 2009), (Nadeau, 2007) - require large training data

◦For Romanian – (Cucerzan, Yarowsky, 1999), (Ion, 2007) and (Machison, 2009) - NER gazetteer for Romanian included in Gate (Cunningham et al., 2009)


We consider the following categories: Person, Organization, Company, Region, Place, City, Country, Product, Brand, Model, and Publication

Named Entity Identification – based on segmentation, tokenizer and lemmatizer components

Named Entity Classification – based on lists, rules, triggers words


Every token with capital letter is then considered to be candidate for named entity

When a candidate is first token in a phrase:1. If it is in our stop word list - we eliminate it from

candidates to be named entities;2. If it is in our common word list

a) when this common word is followed by lowercase words (we check in a list with trigger words). Examples: Universitatea din Cluj-Napoca (En: University of Cluj-Napoca), Țara de Jos (En: Low Country)

b) when this common word is followed by uppercase words. Example: Doctor Stomatolog (En: Doctor Dentist)


On identified candidates we apply rules that unify adjacent candidates, in order to obtain composed named entities candidates:

1) Rules related to person title – Doctorul Popescu (En: Doctor Popescu), Președintele Băsescu (En: President Băsescu)

2) Rules related to organization type – Universitatea Cuza (En: Cuza University)

3) Rules related to abbreviation words – S.C. Travis4) Rules related to special punctuation signs – Ana-Maria5) Rules related to candidates to named entities separated by stop

words - BCR Banca pentru Locuințe (En: BCR Housing Bank), Direcția pentru Sănătate (En: Department of Health)

6) Rules for a specific model/product – Qosimio X500-Q930, Portege R835


We consider the following major categories: City, Organization, Company, Country, Person, and additional we consider categories like Brand, Product and Publication

For almost all major categories we consider subcategories: ◦ For Cities we consider Romanian, European, American and

Other Cities◦ For Organizations we consider Parties, Faculties, Universities,

Ministries, etc.◦ For Persons we consider Sportsmen, Politicians, Males,

Females, etc. A total of 14 main categories with 98 sub-categories



Classification Rules:1. Rules used in unification of NEs candidates2. Pure resource-based rules – for Title type3. Contextual rules - we consider a mix between regular

expressions and available entities from our files - for Organization, Company, Person, City and Country typesa) For example oraș, capitală, târg, localitate (En: city, capital, town,

locality) are triggers for City type,b) companie, corporație (En: company, corporation) are triggers for

Company type, c) partid, bancă, universitate (En: party, bank, university) are triggers

for Organization typed) All titles identified at case named entity identification are triggers for

Person type


Gold: 48 files with 24,244 words and with 1,638 Nes


Agreement between annotators – “PDL Cluj-Napoca” (organization) or “PDL” (organization) “Cluj-Napoca” (city)?

When first word from a sentence is a common word - Ana (Romanian female name) or (rope used on boats)?

Special characters at the beginning of row – segmentation problems


The percentage of the matched and partial matched entities that have been properly categorized is 95.71%

The main problems in NEs classification are related to the fact that exist NEs that are in more than one list of NEs


38 files with a total of 19,509 words and with 1,215 NEs

Upper bound 95.12% (P), 96.40% (R), 95.76% (F), 1.81% (PP), 1.83 (PR)


The problems from upper bound evaluation remain the same

Additionally, appear new problems related to extraction of entities of type Title (which are with small letters)

The problems related to Title represent 3.70% Error cases are represented by following

words: “călugăriță, soră, colonel, viceprimar, co-președinți” (En: nun, nurse, Colonel, vice mayor, co-president)


The percentage of the matched and partial matched entities that have been properly categorized is 66.73%

The error distribution on the named entity types


For Companies, Organization and Person types, the NEs were not found in our resources and the contextual rules could not be applied

For Publication and Product types, they are frequently marked interchangeable.

For Region type, the major cause of errors is due to the fact that respective NE exists also in resources for other type, such as City, Place, Country

An interesting example is the case of PNL (which does not exist in our resources) - when it is preceded by word partid (En: party), it is correctly classified as Organization, but in all other cases, the system does not identify any type for it




In this paper we present a system based on rules and on a list of resources, used in identification and classification of Romanian named entities

The system is able to distinguish between 14 main NE types

Future work will be related to the elimination of problems related to common words that are at the beginning of sentences

Another future direction is related to anaphora, in order to transfer the type of one classified entity to all its referees


The research presented in this paper was funded by the Sectoral Operational Program for Human Resources Development through the project “Development of the innovation capacity and increasing of the research impact through post-doctoral programs" POSDRU/89/1.5/S/49944

The authors of this paper thank the colleagues Alexandru Ginsca, Emanuela Boros, Augusto Perez, Dan Cristea from Faculty of Computer Science Iasi

David Nadeau and Satoshi Sekine, A survey of named entity recognition and classification, Linguisticae Investigationes 30 (2007), no. 1, 3-26, Publisher: John Benjamins Publishing Company.

Silviu Cucerzan and David Yarowsky, Language independent named entity recognition combining morphological and contextual evidence, 1999, pp. 90-99.

H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, GATE: A framework and graphical development environment for robust NLP tools and applications, Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002.

Radu Ion, Word sense disambiguation methods applied to English and Romanian, PhD Thesis, 2007.

Lucian Mihai Machison, Named entity recognition for Romanian (roner), Proceedings of the International Conference on Knowledge Engineering, Principles and Techniques, KEPT2009, 2009, pp. 53-56.

Scurtu V. Stepanov E. Mehdad, Y., Italian named entity recognizer participation in ner task @ evalita 09, 2009.

David Nadeau, Semi-supervised named entity recognition: Learning to recognize 100 entity types with little supervision, PhD Thesis, 2007.

Technology

Named Entity Recognition for Romanian