Upload
faculty-of-computer-science
View
749
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Adrian Iftene1, Diana Trandabăț1
, Mihai Toader2, Marius Corîci2
KEPT Conference, 4-6 July, Cluj-Napoca, RomaniaBabes-Bolyai University
11“Al. I. Cuza”, University of Ia“Al. I. Cuza”, University of Iassi, i, RomRomaaniania11FacultFaculty of Computer Science y of Computer Science
22Intelligentics, Cluj-Napoca, Intelligentics, Cluj-Napoca, RomaniaRomania
The problem that we address Proposed Solution Named Entity Recognition
◦ Named Entity Identification◦ Named Entity Classification◦ Evaluation (Upper Bound, Real Context)
Examples Conclusions
We want to find out users’ opinions on various products, events or persons:◦ I want to buy a certain product (e.g. iPhone). What are
its strengths and weaknesses? What are the opinions of persons who have used it already?
◦ I am the manager of a big company. I am interested in what people say about my company. Which products have good or bad impact? What policy to adopt next?
◦ I am a candidate for next elections. What should I change in my political discourse? Why a part of voters do not like me?
KEPT Conference, 4-6 July, Cluj-Napoca
LiSS Conference, 3-5 May, Iasi
NER - task which finds textual expressions such as the names of persons, organizations, locations, places, etc.
Existing work:◦Statistical models (Nadeau and Sekine, 2007) -
require a large amount of manually annotated training data
◦Machine learning techniques (Scurtu et al. 2009), (Nadeau, 2007) - require large training data
◦For Romanian – (Cucerzan, Yarowsky, 1999), (Ion, 2007) and (Machison, 2009) - NER gazetteer for Romanian included in Gate (Cunningham et al., 2009)
KEPT Conference, 4-6 July, Cluj-Napoca
We consider the following categories: Person, Organization, Company, Region, Place, City, Country, Product, Brand, Model, and Publication
Named Entity Identification – based on segmentation, tokenizer and lemmatizer components
Named Entity Classification – based on lists, rules, triggers words
KEPT Conference, 4-6 July, Cluj-Napoca
Every token with capital letter is then considered to be candidate for named entity
When a candidate is first token in a phrase:1. If it is in our stop word list - we eliminate it from
candidates to be named entities;2. If it is in our common word list
a) when this common word is followed by lowercase words (we check in a list with trigger words). Examples: Universitatea din Cluj-Napoca (En: University of Cluj-Napoca), Țara de Jos (En: Low Country)
b) when this common word is followed by uppercase words. Example: Doctor Stomatolog (En: Doctor Dentist)
KEPT Conference, 4-6 July, Cluj-Napoca
On identified candidates we apply rules that unify adjacent candidates, in order to obtain composed named entities candidates:
1) Rules related to person title – Doctorul Popescu (En: Doctor Popescu), Președintele Băsescu (En: President Băsescu)
2) Rules related to organization type – Universitatea Cuza (En: Cuza University)
3) Rules related to abbreviation words – S.C. Travis4) Rules related to special punctuation signs – Ana-Maria5) Rules related to candidates to named entities separated by stop
words - BCR Banca pentru Locuințe (En: BCR Housing Bank), Direcția pentru Sănătate (En: Department of Health)
6) Rules for a specific model/product – Qosimio X500-Q930, Portege R835
KEPT Conference, 4-6 July, Cluj-Napoca
We consider the following major categories: City, Organization, Company, Country, Person, and additional we consider categories like Brand, Product and Publication
For almost all major categories we consider subcategories: ◦ For Cities we consider Romanian, European, American and
Other Cities◦ For Organizations we consider Parties, Faculties, Universities,
Ministries, etc.◦ For Persons we consider Sportsmen, Politicians, Males,
Females, etc. A total of 14 main categories with 98 sub-categories
KEPT Conference, 4-6 July, Cluj-Napoca
KEPT Conference, 4-6 July, Cluj-Napoca
Classification Rules:1. Rules used in unification of NEs candidates2. Pure resource-based rules – for Title type3. Contextual rules - we consider a mix between regular
expressions and available entities from our files - for Organization, Company, Person, City and Country typesa) For example oraș, capitală, târg, localitate (En: city, capital, town,
locality) are triggers for City type,b) companie, corporație (En: company, corporation) are triggers for
Company type, c) partid, bancă, universitate (En: party, bank, university) are triggers
for Organization typed) All titles identified at case named entity identification are triggers for
Person type
KEPT Conference, 4-6 July, Cluj-Napoca
Gold: 48 files with 24,244 words and with 1,638 Nes
KEPT Conference, 4-6 July, Cluj-Napoca
Agreement between annotators – “PDL Cluj-Napoca” (organization) or “PDL” (organization) “Cluj-Napoca” (city)?
When first word from a sentence is a common word - Ana (Romanian female name) or (rope used on boats)?
Special characters at the beginning of row – segmentation problems
KEPT Conference, 4-6 July, Cluj-Napoca
The percentage of the matched and partial matched entities that have been properly categorized is 95.71%
The main problems in NEs classification are related to the fact that exist NEs that are in more than one list of NEs
KEPT Conference, 4-6 July, Cluj-Napoca
38 files with a total of 19,509 words and with 1,215 NEs
Upper bound 95.12% (P), 96.40% (R), 95.76% (F), 1.81% (PP), 1.83 (PR)
KEPT Conference, 4-6 July, Cluj-Napoca
The problems from upper bound evaluation remain the same
Additionally, appear new problems related to extraction of entities of type Title (which are with small letters)
The problems related to Title represent 3.70% Error cases are represented by following
words: “călugăriță, soră, colonel, viceprimar, co-președinți” (En: nun, nurse, Colonel, vice mayor, co-president)
KEPT Conference, 4-6 July, Cluj-Napoca
The percentage of the matched and partial matched entities that have been properly categorized is 66.73%
The error distribution on the named entity types
KEPT Conference, 4-6 July, Cluj-Napoca
For Companies, Organization and Person types, the NEs were not found in our resources and the contextual rules could not be applied
For Publication and Product types, they are frequently marked interchangeable.
For Region type, the major cause of errors is due to the fact that respective NE exists also in resources for other type, such as City, Place, Country
An interesting example is the case of PNL (which does not exist in our resources) - when it is preceded by word partid (En: party), it is correctly classified as Organization, but in all other cases, the system does not identify any type for it
KEPT Conference, 4-6 July, Cluj-Napoca
KEPT Conference, 4-6 July, Cluj-Napoca
KEPT Conference, 4-6 July, Cluj-Napoca
In this paper we present a system based on rules and on a list of resources, used in identification and classification of Romanian named entities
The system is able to distinguish between 14 main NE types
Future work will be related to the elimination of problems related to common words that are at the beginning of sentences
Another future direction is related to anaphora, in order to transfer the type of one classified entity to all its referees
KEPT Conference, 4-6 July, Cluj-Napoca
The research presented in this paper was funded by the Sectoral Operational Program for Human Resources Development through the project “Development of the innovation capacity and increasing of the research impact through post-doctoral programs" POSDRU/89/1.5/S/49944
The authors of this paper thank the colleagues Alexandru Ginsca, Emanuela Boros, Augusto Perez, Dan Cristea from Faculty of Computer Science Iasi
David Nadeau and Satoshi Sekine, A survey of named entity recognition and classification, Linguisticae Investigationes 30 (2007), no. 1, 3-26, Publisher: John Benjamins Publishing Company.
Silviu Cucerzan and David Yarowsky, Language independent named entity recognition combining morphological and contextual evidence, 1999, pp. 90-99.
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, GATE: A framework and graphical development environment for robust NLP tools and applications, Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002.
Radu Ion, Word sense disambiguation methods applied to English and Romanian, PhD Thesis, 2007.
Lucian Mihai Machison, Named entity recognition for Romanian (roner), Proceedings of the International Conference on Knowledge Engineering, Principles and Techniques, KEPT2009, 2009, pp. 53-56.
Scurtu V. Stepanov E. Mehdad, Y., Italian named entity recognizer participation in ner task @ evalita 09, 2009.
David Nadeau, Semi-supervised named entity recognition: Learning to recognize 100 entity types with little supervision, PhD Thesis, 2007.