Resolving Named Entities and Relations in Text for ... · May 2016. ii. Acknowledgments I would like to express my gratitude to my two supervisors, with special thanks to my main

Resolving Named Entities and Relations in Textfor Applications in Literary Studies

Joana Sofia Dias Rocha

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisors: Prof. Bruno Emanuel da Graca MartinsProf. Pavel Pereira Calado

Examination Committee

Chairperson: Prof. Joao Emılio Segurado Pavao MartinsSupervisor: Prof. Pavel Pereira Calado

Member of the Committee: Prof. Maria Luısa Torres Ribeiro Marques da Silva Coheur

May 2016

ii

Acknowledgments

I would like to express my gratitude to my two supervisors, with special thanks to my main supervisor,professor Bruno Martins. His skills, knowledge and availability were crucial for the completion of thisdissertation. I thank him for all his hard work and for always pushing me to do better.Thanks are also due to many researchers, from which I read their countless papers while aiming for theidentification and applicable results in the area. Their work was invaluable to reach the knowledge andtools we have today, allowing me to continue it for a few months.Final thanks to my family and friends, for their emotional support that helped me through the completionof this big milestone.

iii

iv

Abstract

Lately, there has been an increase of texts available in digital libraries. There is also an increased in-terest in capturing semantic relations expressed between entities from a large amount of texts, hopingto conduct more complex semantic tasks, such as question answering. However, traditional informationextraction techniques have a hard time following this sudden trend, as these techniques rely heavily onmanually annotated resources for training statistical models.This research work had one main objective: to adapt and evaluate two relation extraction systems thatfollow a new Information Extraction paradigm, usually referred to as Open-Domain Information Extrac-tion (OIE), in order to extract relations from Portuguese literary texts. This new information extractiontechnique is able to scale to a massive amount of texts, without requiring as much human involvement.The two systems in focus are named ReVerb and OLLIE.Many tasks were addressed in order to obtain the results presented in this document, starting with thedevelopment of NLP models to process Portuguese texts, followed by the incorporation of these modelsinto the OIE tools, and further changes in implementation details, resulting in two adapted systems thatare able to process Portuguese texts.This document, therefore, formalizes the approaches in the relation extraction problems, involving thenew OIE paradigm on literary texts, presenting an extensive evaluation with four different literary books.

Keywords: Text Mining, Open-Domain Information Extraction, Relation Extraction, Natural LanguageProcessing.

v

vi

Resumo

Ultimamente, tem-se verificado um aumento do numero de textos disponıveis em bibliotecas digitais.Ha tambem um grande interesse em capturar relacoes semanticas, expressas entre entidades, de umagrande quantidade de textos, com o objectivo de realizar tarefas semanticas mais complexas, como porexemplo tarefas de pergunta e resposta. No entanto, as tecnicas de extraccao de informacao tradicio-nais tem dificuldade em seguir esta tendencia subita, visto que estas tecnicas dependem fortemente derecursos manualmente anotados que irao treinar modelos estatısticos.Esta dissertacao teve um objectivo principal: adaptar e avaliar dois sistemas de extraccao de relacoesque seguem um novo paradigma de extraccao de informacao, geralmente referido como Extraccao deInformacao em Domınio Aberto, a fim de extrair as relacoes de textos pertencentes a literatura Portu-guesa. Esta nova tecnica de extraccao de informacao e capaz de se adaptar a uma grande quantidadede textos, sem necessitar de tanto envolvimento humano quanto as tecnicas tradicionais. Os dois sis-temas em foco sao denominados por ReVerb e OLLIE.Muitas tarefas foram abordadas a fim de obter os resultados apresentados presentes neste docu-mento, comecando com o desenvolvimento de modelos de Processamento de Lıngua Natural paraprocessar textos portugueses, seguido depois pela incorporacao destes modelos nas duas ferramentasmencionadas, e finalmente foram realizadas modificacoes relacionadas com detalhes especıficos deimplementacao, resultando assim em dois sistemas adaptados que sao capazes de processar textosem Portugues.Este documento, portanto, formaliza as abordagens em problemas de extraccao de relacoes, envol-vendo o novo paradigma de Extraccao de Informacao em Domınio Aberto sobre textos literarios, apre-sentando uma extensa avaliacao com quatro livros de diferentes autores.

Palavras-chave: Prospeccao de Texto, Extraccao de Informacao em Domınio Aberto, Extraccao deRelacoes, Processamento de Lıngua Natural.

vii

viii

Contents

Acknowledgments iii

Abstract v

Resumo vii

List of Figures xii

List of Tables xiv

1 Introduction 11.1 Hypothesis and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Fundamental Concepts 62.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Tokenization and Sentence Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 Parts-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Named Entity Recognition and Classification . . . . . . . . . . . . . . . . . . . . . 82.1.4 Noun-Phrase Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.5 Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.6 Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Machine Learning for NLP and IE Applications . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.2 Maximum Entropy Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Evaluation of NLP/IE Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Related Work 183.1 Information Extraction in Literary Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.1 Social Network Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.2 Identification of Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.3 Extraction of Family Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Open-Domain Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 TextRunner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

ix

3.2.2 WOE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.3 ReVerb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.4 Open Language Learning for Information Extraction . . . . . . . . . . . . . . . . . 253.2.5 SRL-IE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Open-Domain Information Relation Extraction in Portuguese . . . . . . . . . . . . . . . . 283.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Resolving Named Entities and Relations in Literary Texts 314.1 POS Tagging, Entity Recognition and Parsing Portuguese Texts . . . . . . . . . . . . . . . 31

4.1.1 Sequence Tagging Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.1.2 POS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1.3 Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.1.4 NP-Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.1.5 Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Open-Domain Relationship Extraction in Literary Texts . . . . . . . . . . . . . . . . . . . . 364.2.1 Adapting ReVerb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.2 Adapting OLLIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Experimental Validation 425.1 Datasets and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2.1 POS Tagging, Entity Recognition and Parsing Portuguese Texts . . . . . . . . . . 445.2.2 Open-Domain Relationship Extraction in Literary Texts . . . . . . . . . . . . . . . . 46

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Conclusions and Future Work 556.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Bibliography 57

x

xi

List of Figures

2.1 Parse tree structures. (a) Constituency parse tree. (b) Dependency parse tree. . . . . . . 10

3.1 TextRunner’s simplified architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 WOE’s architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 ReVerb’s syntactic constraint, i.e. part-of-speech-based regular expression. . . . . . . . . 263.4 OLLIE’s simplified architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Comparison between (a) ReVerb’s original syntactic constraint and (b) ReVerb’s adaptedsyntactic constraint to Portuguese. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

xii

xiii

List of Tables

2.1 Contingency table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Performance on syntactic category prediction on the approach from Elson et. al (2010). . 203.2 Three types of utterances showing the speaker in bold and character-mentions in italic. . 203.3 Main features used in two different approaches when identifying speakers in novels. . . . 21

4.1 Datasets and corresponding amount of sentences and extractions supplied to the newdictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Statistical characterization of all training datasets. . . . . . . . . . . . . . . . . . . . . . . 435.2 Tagger results from testing with 4 datasets merged, using 5-fold cross-validation. . . . . . 465.3 NER results obtained from testing with the CINTIL corpus using 5-fold cross-validation. . 465.4 Constituent parsing results of each datasets, using 5-fold cross-validation. . . . . . . . . . 475.5 Dependency parser results varying datasets, using 5-fold cross-validation. . . . . . . . . . 475.6 Statistical characterization of the relations found on 92 sentences of four different literary

books. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.7 ReVerb results on four different literary books. . . . . . . . . . . . . . . . . . . . . . . . . . 505.8 ReVerb results on four different literary books, without the lexical constraint. . . . . . . . . 515.9 OLLIE results on four different literary books. . . . . . . . . . . . . . . . . . . . . . . . . . 52

xiv

xv

Chapter 1

Introduction

For the past few years, there has been an enormous increase in the amount of texts available in digi-tal libraries. Avid readers and Humanities’ scholars, researching about literary texts and trying to answertheoretical questions about its content, consequently have their productivity damaged as they cannotaccompany this sudden trend. Recent developments have also led to the use of various techniquesfrom the area of Information Extraction within distant reading approaches, for instance facilitating theliterary analysis and the automatic construction of knowledge bases containing rich information aboutthe world’s entities and their relations, as expressed in books. DBpedia1 is one example of a large-scale knowledge base, built from the extraction of structured information from Wikipedia pages, suchas infoboxes and links to external web pages. There is also an increasing interest in capturing seman-tic relations from large amount of texts, hoping to conduct more complex semantic tasks like questionanswering. Knowledge bases can be fundamental for some of these complex semantic tasks, and infor-mation extraction techniques are used in order to populate them.This document addresses the subject of Information Extraction, namely by focusing on a novel extrac-tion paradigm called Open Information Extraction. Traditional information extraction techniques rely onextensive human involvement in the form of hand-written rules or hand-tagged training examples. OpenInformation Extraction systems, instead, are able to scale to massive and heterogeneous texts, withoutrequiring as much human involvement as traditional information extraction techniques. Literary textsevidence these challenges as they can be very large and of different genres. Even though knowledgebases, built from Web resources, are extremely important to conduct interesting tasks related to Infor-mation Extraction, this thesis will focus only on evaluating this new information extraction technique onbook contents, particularly focusing on the Portuguese language.The following sections describe the hypothesis and the methodology used, as well as the main contribu-tions that were achieved with the development of this thesis.

1.1 Hypothesis and Methodology

My MSc thesis concerns with the general problem of Information Extraction on works of fiction books,envisioning applications in literary studies and the extraction of semantic knowledge between fictionalentities on large documents. The language of the literary texts in focus will be Portuguese, unlike themajority of previous work developed in the field which has focused mainly on English texts.

1http://dbpedia.org

1

My research work tried to prove, through controlled experiments, the following hypothesis:

• Open Information Extraction techniques can be effective in extracting information, namely validrelations between characters and locations, in Portuguese literature texts.

The methodology and datasets used in my work are described in the following sections.

1.1.1 Methodology

In order to evaluate the hypothesis, two current state-of-the-art OIE systems were modified. Thesesystems are the ones with currently the best trade-off in terms of accuracy and performance:

1. I adapted ReVerb2, by modifying the implementation details related to its most important rulesof extraction, the syntactic and semantic rules. New NLP models were also trained and used, asthese were required to perform basic NLP operations over Portuguese texts. Such models concernwith the tasks of POS tagging and NP-chunking;

2. I adapted OLLIE3, by supplying its bootstrapper with seeds produced from the now modified Re-Verb. It was also required to change the tools that perform the basic NLP operations, namely thePOS tagger and the dependency parser.

I have used the Stanford CoreNLP4 framework to provide the components for the training and useof the aforementioned NLP models. A NER model was also developed, using the same framework, inorder to extract the named-entities from literary texts.

1.1.2 Datasets

The following list of datasets was used to create NLP models in the context of this thesis. Thesedatasets correspond to previously annotated corpora of Portuguese texts, supporting the training ofNLP models:

• CINTIL corpus5 of modern Portuguese: composed of 1 million annotated tokens (POS tags andnamed entities using the BIO encoding), in which 33,96% of the document came from newspapers,16,80% from literature texts and 42,18% from informal conversations transcribed;

• Universal Dependencies6: contains treebanks with over 40 different languages with cross-linguisticallyconsistent dependency annotations, developed by adapting and harmonizing variants of the Stan-ford typed dependencies [45]. It also contains 12 broad POS tag categories7 that exist across over25 language specific tagsets;

• Floresta Sintatica8: contains Bosque which is composed of 9,368 human annotated sentences.Bosque contains two sets of annotated sentences, morphological (POS tagged) and syntacticannotations (constituent parsing);

2http://reverb.cs.washington.edu3https://github.com/knowitall/ollie4http://stanfordnlp.github.io/CoreNLP/5http://cintil.ul.pt/pt/cintilwhatsin.html6http://universaldependencies.org7https://github.com/slavpetrov/universal-pos-tags8http://www.linguateca.pt/Floresta/

2

• Tycho Brahe Parsed Corpus of Historical Portuguese9: composed of 66 literary texts, rangingfrom the year 1380 and 1881, making 2,842,809 words available [29]. It also contains two sets ofannotated sentences, one morphological and the other one with syntactic annotations.

Lastly, I have selected these following public domain books10, due to having different writing stylesand being narrated in the third person. This creates a bigger diversity, thus allowing a more extensiveevaluation:

• Os Maias, written by Eca de Queiros;

• Orgulho e Preconceito, written by Jane Austen, translated by Lucio Cardoso;

• Amor de Perdicao, written by Camilo Castelo Branco;

• Alice no Paıs das Maravilhas, written by Lewis Carrol, translated by Isabel De Lorenzo.

1.2 Contributions

The following contributions are the result of this MSc thesis:

• I created and evaluated POS tagging, NER and dependency parsing models for the Portugueselanguage, using Stanford’s software framework for training and testing these models. The datasetsinvolved in each task model training process, mentioned in Section 1.1.2, went through a processof normalization following the Universal Part-of-speech Tagset and the Universal Dependenciesguidelines. More details on the normalization process are presented in Chapter 4, while detailsabout the evaluation are given in Chapter 5. These models were then tested using a 5-fold crossvalidation technique, were the obtained F1-measure is 90.05%, 90.37% for POS Tagger and NER,respectively. Dependency parser model reached 79.98% UAS and 75.91% LAS;

• I adapted ReVerb, a tool for performing relation extraction following the OIE paradigm to extractrelation phrases and their arguments from Portuguese texts. The main differences from the origi-nal tool, besides implementation details related to English and Penn Treebank tags specific code,reside on the creation of the new ReVerb dictionary and the confidence function re-training. Re-Verb’s dictionary was built using the datasets in Section 1.1.2 plus all of Wikipedia’s content pagesin Portuguese and a set of news from Publico. The confidence function was trained by manuallyannotating obtained extractions from around 200 sentences taken from Wikipedia articles. ReVerbreached between 32.26%-38.71% of F1-measure, after evaluating it on a set of sentences fromfour different books;

• I adapted OLLIE, a relation extraction tool following the OIE paradigm that aims to overcome Re-Verb’s limitations. OLLIE was modified in order to extract relation phrases and their argumentsfrom Portuguese texts. Besides language-specific code related to English, Penn Treebank tagsand Stanford Dependencies, modifications happened mostly on the building and reading of theOpen Pattern Templates, which are essential to extract relation tuples. A bootstrapping model wasbuilt using high-confidence tuples obtained from the adapted ReVerb by processing all datasetsavailable (mentioned in Section 1.1.2 plus Wikipedia Portuguese articles and a set of news text

9http://www.tycho.iel.unicamp.br/ tycho/corpus/en/10http://www.livros-digitais.com/

3

from Publico). The confidence function was trained by manually annotating the obtained extrac-tions, again from around 200 sentences taken from Wikipedia articles. OLLIE reached between4.26%-26.09% of F1-measure, after evaluating it on a set of sentences from four different books.

1.3 Outline

The remainder of this document is organized as follows:

• Chapter 2 presents the basic concepts in the context of this MSc thesis, which were used through-out its development. It goes through basic concepts within the field of Natural Language Pro-cessing, followed by popular machine algorithms used on sequence classification tasks and theirapplication on information extraction, finishing with the standard ways to evaluate NLP models;

• Chapter 3 presents previous publications, reporting techniques used to extract information in lit-erary texts. It also details various IE systems which use the new Open Information Extractionparadigm to extract relation phrases and the arguments in Web texts. This chapter concludes withrelated work focusing on the use of Information Extraction techniques on Portuguese texts, as theother presented previous studies have been applied only to English texts;

• Chapter 4 details the development processes which originated the aforementioned main contribu-tions. It thoroughly describes each step and tools used in order to obtain each necessary product.The Chapter starts by describing all the carried out steps to obtain each NLP model, and thenproceeds to explain the incorporation and implementation methods in order to develop the twoadapted OIE systems;

• Chapter 5 presents the results obtained from several experiments involving the developed NLPmodels and the two adapted OIE systems, concluding with a critical analysis on the subject infocus;

• Chapter 6 concludes this document by reviewing the main conclusions on the analysis that wasdone over the results, and discusses some possible ideas for future work.

4

5

Chapter 2

Fundamental Concepts

This introduces fundamental concepts, used in the context of this MSc thesis. Section 2.1 describesthe tasks of a typical natural language processing and/or information extraction pipeline, while Section2.2 presents common sequence classification models used in some of the natural language processingtasks that are described. Finally, Section 2.3 presents the metrics that are typically used to evaluate theperformance of these systems.

2.1 Natural Language Processing

Natural Language Processing (NLP) is an interdisciplinary field concerning with tasks that involvethe use of a human language [38]. Human languages are more ambiguous and therefore more complexthan most other languages (e.g. computer languages), giving rise to some problems in their automatedprocessing. The following sections describe the NLP tasks that, in an orderly manner, form the neces-sary steps to extract information from text.

2.1.1 Tokenization and Sentence Splitting

Given a textual input, the first NLP task is usually concerned with segmenting the text into its con-stituent units, i.e. sentences and/or words. These two challenges can equally be seen as a problem ofdefining boundaries. To illustrate this task, consider the case of sentence segmentation. On sentencesegmentation, the boundary problem arises with the period or full-stop character. Unlike the exclamationpoint or the interrogation mark, which both are commonly used at the end of a sentence, the period sym-bol can be ambiguous in most sentences. Simple heuristics can be used in order to break sentences,such as breaking the lines with every punctuation mark, but these do not fully tackle the ambiguity prob-lem. Other more sophisticated approaches resort to rule-based and machine learning techniques toovercome this ambiguity.The tokenization task is instead concerned with the segmentation of words, also called tokens. Themost intuitive approach to accomplish this is to split the sentence by spaces, leaving single words astokens. However, this is not a simple task. To illustrate the issue here, take for example people andorganizations’ names. Often one word is not enough to capture them entirely and a name consisting ofvarious words will be wrongly separated into various tokens. This issue will be discussed in more detailin Section 2.1.3.

6

State-of-the-art tokenizers rely on probabilistic approaches, as these learn and use a probabilistic model.Some popular examples of probabilistic approaches are referred in Section 2.2. Languages that do nothave spaces separating words from each other are usually the languages that benefit more from us-ing the probabilistic approach, e.g. Chinese. In this situation, machine learning algorithms and hand-segmented training sets have been reported to be the most successful approach when dealing withtokenization problem over this kind of languages.

2.1.2 Parts-of-Speech Tagging

Part-of-speech (POS) tagging is the process of labeling each word of a given sentence with itscorresponding lexical category [38]. POS taggers label each word according a limited set of lexicalcategories. Penn Treebank, Brown and the C5 tagset are the most popular tagsets for the Englishlanguage. Each tagset varies in the number of tags: 45 tags on the Penn Treebank, 61 for the case ofC5 and finally 87 for the Brown tagset. Tags not only refer to syntactic categories, but also often to themorphological attributes of a word, i.e. gender and number. These kinds of tags are usually used inmorphologically rich languages such as German and French.Often, instances of a word can have more than one tag. In this situation we are faced with syntacticambiguity. POS tagging aims to eliminate this ambiguity by selecting the most appropriate tag, accordingto the context. Resolving the correct part-of-speech tag can be hard, even for linguists. Some techniquesthat attempt to solve this issue are briefly mentioned next:

Rule-based Tagging: One can use an algorithm that leverages manually coded constraints in order toeliminate possible combinations of part-of-speech tags in sentences and, eventually, find the mostsuited tagging for the sentence [33, 31]. This type of algorithm usually consists of two stages. Thefirst stage uses a dictionary to assign each word to a list of potential part-of-speech tags. The sec-ond stage uses the manually coded constraints — also called hand-written disambiguation rules —to reduce the possible number of combinations, hopefully resulting in only one part-of-speech tagfor each word. If this is not accomplished, then it is necessary to insert more disambiguation rules.Modern versions of this algorithm have a similar architecture but instead have larger dictionariesand larger set of rules.

HMM-based Tagging: One can also the use probabilistic method of Hidden Markov Models in order tocalculate the most likely sequence of parts-of-speech tags that correspond to a given sentence[5]. This requires a manually annotated corpus containing a set of sentences with the correct part-of-speech tag attached to each word. After building the Hidden Markov Model according to theprovided corpus, the Viterbi algorithm can be used to compute the most likely hidden sequenceof tags from a sentence. This method can also be used to identify named entities in a text andit will be detailed in Section 2.2.1. State-of-the art HMM POS taggers [9] usually depend on theprevious two tags. These POS taggers are called trigram models because they take into accountthree words, namely the current word and the previous two words and consequent tags.

The probabilistic approach is often selected when dealing with POS tagging task. Still, other machinelearning algorithms have been used besides Hidden Markov Models, some examples are MaximumEntropy classifiers, Support Vector Machines and Neural Networks.

7

2.1.3 Named Entity Recognition and Classification

Named-Entity Recognition (NER) is one of the most important NLP tasks within the context of Infor-mation Extraction. It has the goal of finding named entities in a text, and classifying them into semanticcategories such as person, organization or location. The standard way to approach this task is by build-ing a model describing the class of entities that are expected on a document and where they usuallyappear. Section 2.2 describes in more detail some of the typical used models for this task. Thesemodels are usually based on statistics, relying on supervised machine learning algorithms. A corpus isrequired in order to build these models, composed of sentences with each word correctly labeled withits entity class. An example of an annotated sentence can be:

The flight from New York has just landed .LOC

In this sentence New York was recognized as an entity, being classified as a location. The anno-tated corpus will hereafter supply the model with gold answers, from which are extracted additional datato help the model detecting these entities. This information is usually called a feature vector and iscomposed of linguistic attributes that describe the entity and its surrounding context. These attributesvary according to each type of entity. Features can describe the morphology of a word, for example ifit is capitalized or not. Some can describe the length and even how is the word formed, e.g. a postalcode where something as 2900 − 123 can be transformed into dddd-ddd (with d meaning digit). Otherfeatures are related to the surroundings of the word, being associated with the sequences that comebefore and/or after the word. For a better illustration, imagine that Mexico City is a location yet to beclassified. Suppose now that, during a classification task, the word Mexico was already identified as alocation name. When trying to classify the next word, which is City, both types of features that werepreviously mentioned apply in this situation. Knowing that the current word is capitalized and that theprevious word was already classified as a location are clearly strong evidences that the current word isstill part of the previous entity, a location, and therefore should be classified as such. Another exampleof an annotated sentence can be:

Harry Potter , the boy who lived .B-PER I-PER O O O O O O

This last sentence shows an example of a tag encoding style, which helps separating the bound-aries of an entity. The BIO encoding is the one being exemplified above, although this is just one ofmany different types of encoding schemes. BIO stands for Beginning, Inside and Other. These tags,conjoint with the type of entity, again help us to know the extent of these entities. Another tag encodingstyle is the SBIEO scheme where each letter stands for Single, Beginning, Inside, Ending and Other.Features are then extracted based on the training corpus, containing annotated sentences.NER is usually seen as a word-by-word sequence labeling task. The feature concept that was intro-duced will now be able to solve this detail. Knowing that entities to be detected in this work are usuallycomposed of proper nouns, it can be inferred that all the proper nouns close to each other belong to thesame entity. This inference is possible due to the word’s features. The fact that words are capitalizedand that their POS tag is a proper noun is a very good indicator that we are dealing with an entity.

8

2.1.4 Noun-Phrase Chunking

Chunking is a partial parsing method, used when one is not in need of a complete parse tree. Al-though parse trees are an important representation to aid on semantic analysis, sometimes a superficialsyntactic analysis only is truly needed. Non-overlapping segments, called chunks, correspond to themajor parts of speech found in each grammar rule. The following example presents a sentence and itschunks:

The bus from New York has arrived.NP PP NP VP

Noun-phrase chunking, in particular, consists of finding the noun phrases — phrases with a mainnoun and some modifiers — in a text, treating others indifferently. The two main techniques for discov-ering chunks in sentences are as follows:

Finite-State Approaches: These methods use finite-state transducers to encode grammar rules.[1]These transducers are then either joined together or cascaded in order to detect chunks. In thefirst case, by merging all of the transducers through their union, the final transducer will receive asentence annotated according to POS tags and output the chunks of that sentence. The secondcase consists in detecting bigger chunks, as the sentence is iterated through the different trans-ducers. Chunking proceeds left-to-right, finding the longest matching chunk from the beginninguntil the end of the sentence. Consequently this is a greedy process, which does not guaranteethat the best chunks for a particular sentence will be found.

Machine Learning-Based Approaches: Data can be annotated with the help of BIO tagging, where B

indicates the beginning of a NP chunk, I indicates the internal tokens of a chunk, and O meansother — in this case, the other tag will be used for all non NP chunks found. After training themodel, and similarly to the case of POS tagging or named entity recognition, features are extractedfrom each word being classified and the chunk tagger associates the most appropriate tags to thewords in a sentence, from which the chunks can be recovered. The output that is expected for theprevious example is as follows:

The bus from New York has arrived .B-NP I-NP O B-NP I-NP O O O

2.1.5 Dependency Parsing

There are two ways to describe a sentence structure in natural language. First option is by breakingthe sentence into constituents (e.g., phrases), which can then be broken into smaller constituents, andsecond option is by connecting individual words. These approaches are called the constituency grammarand dependency grammar, respectively. An example of a dependency grammar parse tree is shown inFigure 2.1. The observation that drives dependency grammars is a simple one: in a sentence, all butone word depend on other words. The verb is usually the root of the sentence, not depending on anyother word. A word only depends on another if either is a complement or a modifier of the latter.

Each link in the dependency parse tree holds two lexical nodes (e.g., words) and is drawn from a fixedinventory of labels, which represent grammatical functions, i.e. both syntactically and semantically. Each

9

Figure 2.1: Parse tree structures. (a) Constituency parse tree. (b) Dependency parse tree.

link is called a dependency, and is an asymmetrical binary relation between two words. Connectionsunite a superior and an inferior term. The superior term is usually called the head of the dependency(i.e., governor, superior) while the inferior term is the dependent (i.e., subordinate, inferior). In theexample shown in Figure 2.1, sing is the head of the dependency and bird is the dependent.Dependency parsing algorithms include: dynamic programming [23], maximum spanning trees approach[44] and deterministic parsing [51].

2.1.6 Relation Extraction

A relation is defined as an association between or among things. The association is usually ex-pressed as an action, through a verb, between two entities, e.g. Sarah lives in Lisbon. Other associa-tions can be extracted implicitly from sentences, such as Richard is Mary’s firstborn child. In the lattersituation, a relationship of parent/son is being stated by the words is firstborn, although this phrase doesnot refer to an action. This is also a valid association, where entities are restricted to mention peopleand not things.Relations are usually described as binary relations in which there are two arguments and a relationdefined between them. However, there are instances where only n-ary (n argument) relationships canbe extracted from a sentence. In order to extract semantic relations like the ones mentioned above, fourgeneral approaches can be used:

Rule-Based Approach: Uses manually crafted rules to extract relations from a text. These patternscan also be infered by several other rule-based systems with lexical, syntactic and/or semanticinformation from a text [49, 36].

Feature-Based Supervised Approach: This approach needs an annotated corpus with relations markedas positive examples. This approach can be divided into two subtasks: relation detection and fur-ther classification. A relation is detected whenever the system classifies it as a positive example.The next step requires feature vectors in order to classify the type of the relation. Feature vectorsare created with linguistic information from the sentence to be classified. This linguistic informa-tion focuses on lexical, syntactic and semantic features of the sentence. For example, while lexicalfeatures focus on words that surround the entities, syntactic features rely on the structure of thesentence, particularly on aspects such as the dependency path between two entities. Finally, se-mantic features are about the type of entities found in that sentence. These feature vectors arethen given into a classifier in order to label the relation where two or more entities interact. SupportVector Machines (SVMs) are one of the various supervised learning techniques that can be used[16].

Kernel-Based Supervised Approach: Feature-based methods create individual feature-vectors for eachtraining example. In some cases, features-vectors cannot be represented reasonably due to lim-

10

ited space, for instance when they require syntactic information (e.g., parse tree). This method,therefore, leverages input representations over a higher dimensional space. Kernel functions arecalculated between two examples x and y, where the kernel K(x, y) quantifies the similarity be-tween x and y. The higher the value of K(x, y) is, the more similar x and y are [47, 60, 17].

Semi-Supervised Approach: Also called bootstrapping, this approach requires a smaller amount oftraining instances. Patterns are then inferred from this small amount of examples and used to clas-sify unlabeled instances. Again, these recently labeled instances are used to infer more patternsof relations, and the procedure keeps doing this iteratively [10, 2, 52]. The problem of semanticdrift arises with this approach, in which continuous inferences over inferences leads in erroneouslyextracted relations compared to the initial intended ones. Some techniques that deal with this issuehave nonetheless been developed [18, 46].

Distantly Supervised Approach: This approach requires a large knowledge base with various rela-tionships between entities. Whenever two entities are mentioned in a sentence, there is a highlikelihood that the relation in the knowledge base, containing these two entities, expresses thesame relationship. Although with same entities, there are some situations when the sentencedoes not reflect the same relationship as in the one in knowledge base. However, the number ofnoisy sentences is expected to be much less than the inferred correct ones [50, 37].

Open-Domain Information Extraction Approach: This approach is better suited when the types ofrelations to be extracted are not known a priori. It can handle large and heterogeneous texts, likeWeb documents, and it can resort into two techniques: by using shallow features and dependencyparsing. Shallow features refer to properties of a word, such as its POS tag, while dependencyparsing captures the relations between words and the structure of the sentence. The result of thelatter process is a tree where grammatical dependencies between constituents of the sentence areexpressed. Examples of some of these systems are mentioned in more detail in Section 3.2.

2.2 Machine Learning for NLP and IE Applications

To implement Natural Language Processing (NLP) tasks, machine learning algorithms are nowadaystypically used. Sentences are typically modeled as sequences of tokens, and sequence classifiers areoften used. A sequence classifier is a model that receives, as input, a sequence of single units, forexample a sentence or a string, and outputs the sequence of labels that best classifies each unit. Somesets of labels used in NLP applications were already described in the previous sections. Part-of-speechtags, named entity classes and chunks are some examples of labels used in sequence classifiers.Section 2.2.1 describes a generative probabilistic model that assigns the most likely tag based on theprevious assignments. Sections 2.2.2 and 2.2.3 present two discriminative models that extract featuresfor each word and assign a class based on these features.

2.2.1 Hidden Markov Models

An HMM is a statistical model, useful to decode labels of a sentence. In other words, HMMs areused to calculate and assign the hidden events (e.g. tags) from observable events (i.e. the words of asentence). The outcome of this model is influenced by the following three factors:

1. The probability of a given label being the first;

11

2. The probability of a specific label being followed by another;

3. The probability of a specific label resulting from each word;

Each factor described above corresponds to one of the probability matrices defined in the formaldefinition of an HMM, A, B and π respectively. An HMM λ is thus formally defined as λ = (A, B, π),in which A = a11, a12, ...aNN is the transition probability matrix where aij is the probability of being instate j at time t given that we were in state i at time t-1, B = bi(ot) is the emission probabilities whichindicate the likelihood of an observation symbol ot being generated from state i, and finally π = π1, π2,..., πN is the initial probability distribution which indicates the probability that the Markov chain will startin state i. In order to build these matrices, a training corpus needs to be supplied with annotated labelson each unit to be classified. To better exemplify an assignment using an HMM, this simple sentencewill be used:

The mouse ate the cheese.

Given this sentence, any of the classification tasks were already described – POS in Section 2.1.2,NER in Section 2.1.3 and Chunking in Section 2.1.4 – can be accomplished by using this model. Let usimagine a task of POS tag assignment. Equations 2.1, 2.2 and 2.3 translate into entries of the alreadymentioned parameters π, A and B of a HMM. These equations count the frequencies of the situationsin which the words from the input sentence appear, resulting in a probability stating the confidence inwhich they might appear in that situation again.

πDet = P(Det|<start>) = Count(<start> Det)

Count(<start>)(2.1)

aDet,Noun = P(Noun|Det) = Count(Det Noun)

Count(Det)(2.2)

bDet(The) = P(The|Det) = Count(The=Det)

Count(Det)(2.3)

After having an HMM defined, i.e. after all matrices are filled with the calculations presented in theprevious equations, the Viterbi algorithm [38] is usually employed to find the most likely hidden sequencegiven an observed sequence and the model. The Viterbi algorithm now calculates probabilities for eachpossible generation path by using the probabilistic matrices and by taking the maximum value of theprevious path calculations as it goes forward to the assignment of the next word. When it reaches theend of the sentence, the Viterbi algorithm now backtraces to the starting state by following the paththat passes through the state with the higher probability. This sequence of states is now our resultingsequence of tags T that assigns the best label to a given sentence, as stated in Equation 2.4. Eachstate that belongs to the most likely path corresponds to a label to be assigned to the correspondingword on the input sentence.

T = argmaxT∏i

P (wordi|tagi)∏i

P (tagi|tagi−1) (2.4)

Even though Hidden Markov Models can perform sequence assignment, adding additional infor-mation is extremely hard. This additional information (e.g., word capitalization) can to be very helpfulwhile addressing, as an example, NER tasks. Detection of a person’s name is easier by detecting the

12

words that are capitalized. In this situation, checking if a word is capitalized can be seen as a featureand HMMs are not able to directly handle this kind of additional information while calculating the mostprobable hidden sequence.

2.2.2 Maximum Entropy Markov Models

Unlike the HMM, which is a generative model, Maximum Entropy Markov Models (MEMM) [38] arediscriminative models. The main difference between these is that HMM’s starting point is the hiddenevents (e.g., labels). A generative model needs to enumerate all the possible observation sequencesfrom the hidden events. MEMMs take the observation data as given and calculates the probability of thepossible label sequences according to the observable events. It accomplishes this by taking additionalinformation from the input sequence, called features.

T = argmaxT∏i

P(tagi|wordi, tagi−1) (2.5)

Equation 2.5 demonstrates that MEMMs compute a single probability function at each state, con-ditioned on the previous hidden state and the current observation, to estimate the most likely hiddensequence T . For each class there is a set of features, and each feature fi is associated to a weight wi.The weight determines how much a feature defines a particular class. Equation 2.5 uses the expressionin Equation 2.6, which shows the estimation of the probability of a transition from state qi to a state qjproducing an observation o. Equation 2.7 shows the normalization factor, in order to make the probabil-ities correctly sum to 1. It is equivalent to the sum of all the possible classes C, taking into account thecurrent word and its features.

P(qj |qi, o) =1

z(o, qi)exp

(∑i

wi fi(o, qj)

)(2.6)

z(o, qi) =∑C

exp

(∑i

wi fi(o, qj)

)(2.7)

To better illustrate this model, let us use again the example sentence on Section 2.2.1 for the case ofPOS tagging task. However, imagine that we are on a path where The was assigned as a determinantand mouse is being currently estimated.

The mouse ate the cheeseDET ?

A good feature that can be considered to aid in the correct tag assignment is shown in Equation2.8. Features are usually represented as binary functions and they indicate, for each class c, propertieswhere the specified class should occur. Equation 2.8 indicates that a class NOUN is expected to comeafter a class DET, which is exactly what is happening in the example.

f1(c, x) =

{1 if ti−1 = DET & c = NOUN;

0 otherwise.(2.8)

13

RealClass A Class B

System Class A True Positive (tp) False Positive (fp)Class B False Negative (fn) True Negative (tn)

Table 2.1: Contingency table.

Before calculating the estimation to each class of the word mouse, the weights for each feature needto be assigned. In order to train the MEMM, first the training data is split into subsets. Each of thesesubsets is a pair < o, qj > composed of the observation o and its destination state qj (e.g. label),each transitioning from a state qi. Once all the training data has been processed into subsets, theGeneralized Iterative Scaling (GIS) algorithm [43] is applied to train each probabilistic model associatedto each subset. It will then output the values for each feature weight that maximize each likelihoodfunction. Finally, the feature weights obtained are all merged into one single probabilistic model that isused to calculate the best suited label sequence for each input.As in the case of HMMs, the Viterbi algorithm is also used to decode the best label sequence by followingthe back-pointers that backtrack on the sequence input, selecting the best output label sequence.

2.2.3 Conditional Random Fields

Conditional Random Fields (CRF) [40] constitute an approach similar to Maximum Entropy MarkovModels. This approach estimates the most likely hidden sequence with a procedure similar to that ofEquations 2.5 and 2.6. However, the difference resides in the normalization factor z. While MEMMsapply the normalization factor to each state, CRFs apply the normalization factor only at the end of thepath, avoiding the method to prefer states that have less arcs leaving from them. Therefore, Equation 2.9demonstrates the calculations for the best hidden sequence, where the normalization factor is shown inEquation 2.11. Finally, Equation 2.10 is the probability of a word belonging to a class, taking into accountits features and corresponding weights, discarding the normalization factor.

T = argmaxT

(1

z(o, qi)

∏i

P(tagi|wordi, tagi−1)

)(2.9)

P(qj |qi, o) = exp

(∑i

wi fi(o, qj)

)(2.10)

z(o, qi) =∑C

∏i

P(tagi|wordi, tagi−1) (2.11)

2.3 Evaluation of NLP/IE Components

The assessment of the quality of the obtained results, for most of the Natural Language Processingtasks that were mentioned in Section 2.1, is usually performed in terms of several evaluation measuresfrom the area of Information Retrieval. Precision, Recall, the F1-measure and Accuracy are commonevaluation measures to compare the output of each task with human annotated answers. To betterunderstand these metrics, a contingency table is presented in Table 2.3.

14

Table 2.3 demonstrates possible assignments for a binary classification task, in which the instanceis classified as either belonging to class A or B. A good example of a binary classification is the onepresented in Section 2.1.1, when a period symbol is ambiguous. The goal is to classify the symbol andfind out if it indicates the end of a sentence or not. This is the simplest case and good to exemplifythe measures that will be presented. In Table 2.3, the columns refer to the true class of the item beingclassified while the rows refer to the classifier’s answers. Based on this table, the evaluation measuresare easily derived. Precision for class A gives us the percentage of items classified as belonging to classA that are correct, as shown in Equation 2.12. Recall for class A (in Equation 2.13), on the other hand,gives the percentage of items in class A that were selected. Both of these measures incur in a tradeoff and it is difficult to opt for one that better evaluates a task. The F1-measure is a combination (i.e.the harmonic mean) of precision and recall, attempting to yield a balanced value, not favoring either theprecision or the recall measure. Finally, accuracy is the total percentage of instances that the systemgot correct, as shown in Equation 2.15.

Precision =tp

tp+ fp(2.12)

Recall =tp

tp+ fn(2.13)

F1-measure =2× Precision×Recall

Precision+Recall(2.14)

Accuracy =tp+ tn

tp+ fp+ fn+ tn(2.15)

Most classification problems, however, usually focus on more than two classes. In these multiclassclassification tasks, the contingency table is expanded to n × n cells, where n is the number of classesto be classified. Each entry xij contains the number of instances that were classified as a class j, andbelong to the class i. A diagonal matrix, or table, will reflect an ideal classifier.Micro and macro-averaged methods can also be used when dealing with multiple class labels. Imaginethat we have the true positives (tp), false positives (fp), false negatives (fn), precision (P ) and recall(R) results for two sets of data. These sets can each represent statistics from a specific class label.The micro-averaged method calculates precision and recall, in a way that takes in consideration thedifferent sizes of both data. The method, therefore, corresponds to the harmonic mean of the results, asshown in Equations 2.16 and 2.17. The macro-averaged evaluation method, on the other hand, is simplya straightforward average of the precision (Equation 2.18) and recall (Equation 2.19) measures of thedifferent sets, in order to get hold of the overall performance. The F1-measure, in both these methods,continues to be calculated as shown in Equation 2.14.

Precisionmicro =tp1 + tp2

tp1 + tp2 + fp1 + fp2(2.16)

Recallmicro =tp1 + tp2

tp1 + tp2 + fn1 + fn2(2.17)

Precisionmacro =P1 + P2

2(2.18)

Recallmacro =R1 +R2

2(2.19)

15

2.4 Summary

This chapter presented the concepts necessary to frame the reader into fully understanding the morecomplex tasks addressed throughout this thesis. The following list summarizes the key points discussedin this chapter:

• Natural Language Processing is a task that relates with the handling, or processing, of textualdocuments. These textual documents are usually complex since they contain human languages,and, in order to be processed, they go through a pipeline of operations that infer information fromthem. The information inferred from one step will help the following task in the pipeline. Typicallythis pipeline is composed of the following tasks: sentence splitting, tokenization, part-of-speechtagging, NP-chunking, named-entity recognition, dependency parsing and relation extraction;

• Many NLP tasks can be formulated as sequence classifier problems. This chapter described thetypical machine learning algorithms that are used in order to label each unit (e.g. token) from asequence of string units (e.g. sentence). Commonly used sequence classifiers, and also slightlytweaked variations according to the task at hand, are Hidden Markov Models, Maximum EntropyMarkov Models and Conditional Random Fields;

• Finally, an explanation on how to evaluate each NLP task is given, following a contingency tablewhich is a starting point for binary classification. Multiclass classification tasks increase the size ofthis contingency table, where each cell on the table is calculated according the correct and wronginstances of each class.

16

17

Chapter 3

Related Work

This chapter presents previous relevant work done in the area of Information Extraction (IE). The fol-lowing sections describe IE techniques used in various situations, including the ones where informationis extracted from literary books, in Section 3.1, and Portuguese texts, in Section 3.3. In Section 3.1, themethods used on literary texts focus mainly in extracting interactions between characters, in an attemptto outline the plot. Section 3.2 describes various Open-Domain Information Extraction systems, which isthe main subject of this thesis. These systems rely on different techniques to extract relations, resultingin two main branches: either using shallow features or dependency parsing. Finally, some approachesthat have been used on Portuguese texts are described in Section 3.3.

3.1 Information Extraction in Literary Studies

This section presents some publications related to the usage of Information Extraction techniques,mainly supervised, in literary texts. Information Extraction on literary texts have been focusing on defin-ing networks of characters where an interaction between two entities (e.g., people and locations) isrepresented visually. Other approaches have been dealing with quoted speech and the assignment ofthe character that is responsible for each quote, in order to be able to outline the plot in books that con-tain mostly quoted speech. Lastly, some publications have presented methods to target the extractionof specific relationships between characters in text.

3.1.1 Social Network Extraction

Most IE approaches that focus on literary texts try to extract interactions between characters andplaces when gathering information from them. These approaches usually express the gathered infor-mation in the form of a social network, where different nodes (i.e. entities) are connected whenevera textual semantic relationship is found between them. This general approach became popular withFranco Moretti and his definition of distant reading, which he defines as a method that tackles literaryproblems, namely focusing on plot analysis, by scientific means. Moretti also created a Literary Lab1 inStanford where a group of people, including Moretti, pursues literary research of a digital and quantita-tive manner. One of the projects developed was the study of plots in terms of network theory [48].

1http://litlab.stanford.edu/current-projects/

18

Other studies followed Moretti’s plot analysis. For instance Elson and McKeown (2010) [24] developedan approach that automates the study of a large sample of literary texts, more specifically 60 nineteenthcentury novels from various categories. His goal was to verify some theories about the social world ofnineteenth century fiction, focusing mainly on character’s interactions. First, people and organizationswere identified by using the Stanford NER. A clustering process follows, creating coreferents of the en-tities found based on a previous method [19], in which certain words of the multiword name are omitted.A speaker is then assigned to each quoted speech in the text by using the method described in moredetail in Section 3.1.2.

Even though literary studies center on characters and their direct interaction, there is semantic knowl-edge outside quoted speech. For example, in the Genesis book, there are 330 distinct person names butonly 53 of those are involved in dialogue interactions. In this situation, the previous method would only beable to capture a small part of the character’s relationships. Lee et al. [41] try to extract social networksfrom texts that lack dialogue interactions. There are three ways to describe a social relationship:

1. Explicitly, where a relationship is stated (e.g. Noah had three sons: Shem, Ham and Japeth);

2. Implicitly non-verbal, when characters interact non-verbally and are aware of it;

3. Implicitly verbal, for instance through quoted speech and similar communication mechanisms thatare not represented as a dialogue.

The proposed approach [41] first uses Stanford NER and Stanford’s Deterministic Coreference Res-olution to extract entities and to avoid missing relations where a pronoun is found instead of an entity.When a pronoun is found, this step associates it to an entity within n sentences, where the parameter nis tuned based on the development data. However, not all sentences where two entities co-occur repre-sent an interaction. This approach handles this issue by leveraging POS and dependency information,as well as resorting to semantic information from FrameNet2. In order to determine whether if two peopleare involved in a social interaction, the two named entities have to be marked as the subject-object pairof a verb, or as a pair connected by a coordinating conjunction, serving as a subject or object. FrameNetis also useful to check what frame does the verb belong to as only verbs associated to frames that indi-cate social interactions are accepted as a valid relation.It also associates people with locations by identifying every named entity following a mention of a lo-cation. As expected, this not always comes as true. Experimental results showed that in a number ofcases, people appeared before a mentioned location and were in fact in that said location. Evaluationwas done using five books in the Hebrew Bible: one as a development set and the rest four for thetest set. Besides inaccuracies related to the coreference resolution and problems with NLP models, nosignificant problem was reported.

3.1.2 Identification of Speakers

Quoted text is a particularity that some documents present, such as literary texts or news articles.Most work on quoted speech has focused on the news domain, but quotes can come in various syn-tactic forms. Table 3.2 shows three types of quotes (i.e. utterances) that can appear in literary texts.Approaches in literary texts try to assign a speaker to each quote found in text. The baseline approachfor this task assigns the entity that is closest to the quote. However, correct attribution depends on the

2https://framenet.icsi.berkeley.edu/fndrupal/

19

Syntactic Category Solver Feature Vector Correct %Quote-Said Person Pattern Matching n/a 0.99Added Quote Pattern Matching n/a 0.97Backoff Logistic+J48+JRip

#»

f 0.64Quote Alone Logistic+J48+JRip

#»

f − #»

f mean 0.63Apparent Conversation JRip

#»

f − #»

f min 0.93Anaphora trigram Logistic

#»

f − #»

f mean 0.63Quote-Person-Said JRip

#»

f 0.97Overall – – 0.83Baseline-Nearest – – 0.52

Table 3.1: Performance on syntactic category prediction on the approach from Elson et. al (2010).

Type of Utterance ExampleImplicit Speaker Dont keep coughing so, Kitty, for heavens sake!Explicit Speaker I do not cough for my own amusement, replied Kitty.Anaphoric Speaker Kitty has no discretion in her coughs, said her father.

Table 3.2: Three types of utterances showing the speaker in bold and character-mentions in italic.

syntax and semantics of the scene.

Elson et. al (2010) describes the building of a categorizer [25], trained with a corpus consisting of3,176 instances of quoted speech, and that was able to determine the speaker with an accuracy of83%, surpassing the performance of the nearest character baseline method. The developed approachworks by first identifying the candidate speakers, i.e. all the named entities preceding each quote intext. Coreferents are then produced and linked together as the same entity by using a method similarto the one described in Davis et. al (2003). Cleaning and normalization is also performed, removingunessential sentences and words, and finally character mentions and verbal expressions are encodedwith symbols.The next step is to classify each quote, through pattern matching, into one of seven syntactic categories.These syntactic categories define the underlining semantics of some quotes, allowing a better predictionof its speaker. Two of these categories already imply a speaker for the target quote. For example, if aquote is classified as Added quote, then its speaker will be the same as the previously assigned one.Machine learning techniques are used for the remaining five categories. Three predictive models arenow built and feature vectors are extracted from each candidate-quote pair (features shown in Table3.3). The feature vectors used are the result of the relative difference of either the average or minimumvalue feature vector,

#»

f mean and#»

f min respectively. Various classifiers were also experimented such asJ48, JRip and a two-class logistic regression model with a ridge regularization term, as available in theWEKA toolkit [32], with best results being shown by category on Table 3.1.

He et al. (2013) distinguishes three types of utterances commonly found in literary texts [35], asmentioned in Table 3.2. Typically, most utterances are found within the Implicit Speaker category, whilethe Explicit Speaker category is the one less used in modern fiction.Explicit speakers are extracted by focusing on the speech verbs that appear either before or after thequote. The example found in Table 3.2 shows the speech verb replied and is followed by the speaker’sname Kitty.In order to locate the speaker from an anaphoric expression, the method proposed by He et al. (2013)uses a dependency parser which finds a link between a speech verb with a noun phrase that is the syn-tactic subject of a clause. Once the possible speakers have been been located, the gender informationof each one is determined by following some syntactic rules. If there is more than one candidate speaker

20

Table 3.3: Main features used in two different approaches when identifying speakers in novels.Features Elson et al. (2010) He et al. (2013)Distance from candidate to utterance Yes YesSpeaker appearance count Yes YesSpeaker name in utterance Yes YesUnsupervised actor-topic model No YesVocative speaker name No YesNeighboring utterances No YesGender matching No YesPresence Matching No YesPunctuation between candidate and quote Yes N/ALength of the quote Yes N/AQuote position in paragraph Yes N/A

for each quote, a ranking model is used according to the features mentioned in Table 3.3.The vocative speaker name feature is used to detect if a character is mentioned in an utterance. If avocative is found in a utterance (e.g. Kitty in the anaphoric example from Table 3.2) then it is likelythat that vocative is a potential participant in a dialogue. This vocative detection is accomplished with alogistic regression classifier [3] that was trained with features that are designed to capture punctuationcontext, as well as typical phrases that accompany vocatives.For evaluation, Pride and Prejudice chapters and a corpus featuring utterances from 19th and 20th cen-tury English novels were used as a test set. Since the goal here is to match utterances to characters, apreprocessing step is performed, in which the list of characters is extracted beforehand using StanfordNER. Aliases are then produced from this list of characters. Experimental results were measured bycomparing three different models: neighboring, individual, and the baseline. The neighboring modelfollows the described approach while the individual model is an attempt to reproduce the approach fromElson et al. (2010). The obtained results show a better performance from the neighbor model, achievingan accuracy of 79% compared to 73% from the individual model.

3.1.3 Extraction of Family Relations

Some previous approaches have instead focused on extracting specific relations from literary texts.One particular approach [42] tried to used the ReVerb system, an OIE system that will be discussed inSection 3.2, to extract family relations on novels. Despite many relations obtained, none of them cap-tured family relationships between characters. Therefore, a new approach was created by combiningword level techniques and utterance (i.e. quote) attribution approaches.The proposed approach firstly employs the method from Elson et al. (2010) to assign speakers to eachquote in narrative followed by a supervised method [34] to filter the candidate utterances. In this step,a small list containing target nominals, that describe a family relation, is expanded with WordNet syn-onyms and hypernyms. After finding the utterances that indeed represent a family relation (i.e., quotesthat contain the target nominals) relations are extracted in the form of a standard triple (A1, R, A2),where A1 and A2 denote the arguments, and R the family relation mentioned. The speaker assignedto the utterance becomes A2 , while the nominal found is now the relation R. The first argument ofthe relation is found by selecting the preceding and following speaker of each utterance and checked ifthe gender of these both candidates match the nominal. If both candidates get rejected, the relation isabandoned. Finally, new relations are inferred from the extracted relations – seed relations – using arule based propagation technique. Development and evaluation was done on Jane Austen’s Pride andPrejudice due to having a fairly high number of characters and a rich set of family relations.

21

The first 23 chapters were used as the development corpus and the rest of the chapters for evalu-ation. The Columbia Quoted Speech Attribution Corpus (CQSA3) was also used as a test set [25].Performance of the utterance attribution step showed a 5% lower result comparing the original fromthe implemented approach, having an accuracy of 78%. In the vocative detection step, the supervisedmethod using a Naive Bayes classifier achieved an F1-measure of 88%. Finally, and due to propagationerrors, the tuple extraction step achieved 77% precision while having a low recall of 27%.

3.2 Open-Domain Information Extraction

This section describes state-of-the-art Open-Domain Information Extraction (OIE) systems that ex-tract semantic relations on English texts. Systems described here follow two major techniques, namelyshallow parsing and dependency parsing, and try to achieve a good overall performance since thesetwo techniques pose a trade-off between accuracy and processing time.

3.2.1 TextRunner

TextRunner [6] pioneered the alternative IE paradigm of open domain extraction, by being the firstscalable and domain-independent OIE system. Its architecture consists of three modules, addressingthe main challenge that traditional IE systems faced – extensive human involvement. Briefly, the threemodules are the following:

Self-supervised Learner: The learner module receives a small set of non-labeled training data andproceeds with labeling it with a syntactic parser [39]. Each extraction takes the form of a tuplet = (ei, rij , ej), where ei and ej , with i < j, denote the entities or arguments of the relation, andwhere rij is the textual string that denotes the relationship between these entities. The learnerlabels one extraction as positive if ei and ej are in accordance to some constraints on the syntacticstructure, or negative otherwise. Once a set of tuples are found and labeled, it matches eachtuple to a domain-independent feature vector representation. The learner then uses these featurevectors as an input to a Naive Bayes classifier;

Single-pass Extractor: The extractor module now receives a bigger corpus and automatically labelseach word with its most probable part-of-speech tag. Entities are then found by identifying NPchunks and for each of these. The chunker also associates a probability of each word belongingto that NP chunk/entity. Tuples containing entities with a low probability – low level of confidence –are subsequently discarded. Finally, each tuple is given as input to the classifier previously trainedby the learner module, and is finally classified as a trustworthy relation or not. If trustworthy, tuplesare then stored in a normalized form where non-essential modifiers are omitted;

Redundancy-based Assessor: The assessor module merges tuples were both the entities and thenormalized relation are identical, and assigns a probability to each retained tuple based on aprobabilistic model in order to assign high confidence to extractions occurring multiple times.

TextRunner’s performance was compared against KnowItAll [26], also an unsupervised IE systemcapable of performing large-scale extraction from the Web. The test set involved 9 million Web pagesin which the task was to extract facts from these pages. Since KnowItAll has a more restrictive set of

3http://www.cs.columbia.edu/nlp/tools.cgi

22

Non-labeledTraining Data

UnlexicalizedParsing

FeatureVectors

Naive BayesClassifier

Tuples

Input POS TaggingNP Chunking

Figure 3.1: TextRunner’s simplified architecture.

relations, 1,000 sentences of the test corpus that contained 10 type of relations were selected in order toperform the evaluation. In the end, TextRunner proved to be better than KnowItAll, as its average errorrate was lower by 33%. Furthermore, TextRunner’s extraction process took a total of 85 CPU hourswhile KnowItAll took an average of 6.3 hours per relation.Although TextRunner achieves a higher performance, it lacks some scalable methods when it comes tosynonym relations and to the resolution of multiple name entities, as these were the cause for most ofthe ill-formed tuples. TextRunner also presented another problem in the fact that it cannot distinguishconcrete factual relations (e.g. Tesla invented the coil transformer) over abstract ones (e.g. Einsteinderived a theory).Subsequent work was done involving a new model of TextRunner, where instead of a Naive Bayesclassifier it uses a liner-chain CRF. This improved system, called O-CRF [7], proved to be better byachieving 88.3% in precision and 45.2% in recall.O-CRF’s training process is self-supervised. It applies some relation-independent heuristics to the PennTreebank and obtains a set of labeled examples in the form of relational tuples. Similar to the NaiveBayes TextRunner system, it uses these sets of examples to extract features and then train the CRFmodel. Given an input, O-CRF also does a single pass over the data, performing the POS taggingand NP-chunking. After the extraction, it uses the RESOLVER algorithm [59] to find relation synonyms.RESOLVER is a probabilistic model that uses several relational features, in order to check if whethertwo strings refer to similar items. The model prediction is based on string similarity and distributionalsimilarity. The latter term, works by checking if the properties that belong to one string match closely tothe other string’s properties. Properties in this case are the context words of a sentence in which thestrings in question were mentioned.

3.2.2 WOE

WOE [58] is another OIE system, that nonetheless is different from TextRunner as it automaticallytransfers knowledge from the Web, namely from Wikipedia and DBpedia pages, in order to be able toextract relations without limitations. WOE has another particularity: it disposes of two types of extractorsas it will be described later on. Like TextRunner, WOE has three similar main components that aredescribed as follows:

Preprocessor: This component transforms the Wikipedia article into sentences using OpenNLP4. De-pending on the version of the system selected for training, WOEparse or WOEpos, the preprocessoreither uses the Stanford Parser5 to create a dependency parse, or uses OpenNLP to obtain POS

4https://opennlp.apache.org/5http://nlp.stanford.edu/software/lex-parser.shtml

23

WikipediaPages

POS TaggingNP Chunking

DependencyParsing

SentenceMatching

SentenceMatching

CRF Extractor

PatternClassifier

WikipediaPages

WOEpos

WOEparse

Figure 3.2: WOE’s architecture.

tags and NP-chunk annotations, respectively. Also, the preprocessor uses Wikipedia’s redirectionpages and backward links to automatically construct synonym sets over the entity;

Matcher: The matcher component iterates through all the infobox attributes and searches for a uniquesentence in the Wikipedia article that contains references to both the subject of the article and theattribute value. These will be annotated as the noun phrases, i.e. arg1 and arg2, of a triple(arg1, rel, arg2) where rel is a textual fragment that establishes an explicit, semantic relationbetween the two noun phrases. Since DBpedia has a cleaner set of infoboxes from 1,027,744Wikipedia articles, the matcher uses DBpedia data for extracting the attribute values;

Learning Extractors: WOE has two versions of extractors: WOEparse and WOEpos. WOEparse usesthe Stanford Parser to create dependencies in the colapsedDependency format. This format col-lapses non-useful information for relation extraction, leaving direct dependencies between contentwords. After having selected and annotated the sentences, the learner generates the shortestconnecting path between the subject and the attribute value – called the corePath. Afterwards, thelearner creates a smaller set containing corePaths composed of their POS tags – called general-ized corePaths – and they constitute the final extraction patterns.WOEpos is another extractor available composed of a CRF extractor based on shallow featureslike POS tags. It also generates training data by matching Wikipedia sentences with infoboxesbut instead, each matching sentence will be the positive examples while the negative examplesare generated from noun phrase pairs in the other unmatched/unselected sentences. WOEpos

uses the same learning algorithm and selection of features as TextRunner: a two-order CRF chainmodel trained with the Mallet package.

The developers compared the TextRunner, WOEparse and WOEpos performance. The test set con-sisted of 300 random sentences extracted from three corpora: WSJ from Penn Treebank, Wikipedia andthe general Web.Overall, WOEpos was shown to have a better performance compared to TextRunner. Since WOEpos

uses a similar extractor to TextRunner’s extractor (i.e. same learning algorithm and features) it can beasserted that the improvement in performance is due to better training data (i.e. from Wikipedia viaself-supervision).WOEparse is the extractor that achieves the highest performance in all three datasets. Parser features aremore helpful when handling long and difficult sentences although they come with a cost. While WOEparse

performs better, it takes more time to process a single sentence compared to the other systems.

3.2.3 ReVerb

First generation of OIE systems could not fully capture some relation phrases, showing incoherentand uninformative extractions. Uninformative extractions happen when facing relation phrases that are

24

expressed by a combination of a verb with a noun, with the noun carrying the semantic content of thepredicate (e.g. (Faust; made; a deal) instead of (Faust; made a deal; with the devil)), while incoherentrelations surface because some words are left out of the relation phrase leaving it incomprehensible.ReVerb [27] is an OIE system that eliminates these two issues by introducing syntactic and lexicalconstraints. The syntactic constraint serves mainly for eliminating incoherent extractions and reduceuninformative ones. It imposes that the relation phrases should have a defined sequence of POS tagsalthough this pattern can sometimes create overly specific relation phrases. Therefore, the lexical con-straint was introduced, stating that a valid relation phrase should take many distinct arguments in a largecorpus and thus avoiding relations with only few possible instances. ReVerb also enforces a commonconstraint, used in the previously mentioned OIE systems: the relation phrase must appear between itstwo arguments.Given a sentence as input, ReVerb’s procedure firstly uses OpenNLP for POS tagging and NP-chunking.Then, for each verb in the sentence, it tries to find the main verb and consequently the relation phraseof the sentence by checking the following three conditions:

1. The relation phrase starts with that particular verb;

2. Satisfies the syntactic constraint (implemented as the regular expression in Figure 3.3);

3. Satisfies the lexical constraint.

The lexical constraint is ensured by having previously built a dictionary from 500 million Web sen-tences with the relation phrases agreeing with the syntactic constraint. These relation phrases are thennormalized by removing unessential words and the arguments are identified. The dictionary retainsthose relation phrases that are found with at least 20 distinct arguments.After each relation phrase found by agreeing with all conditions, ReVerb finds the nearest noun phrasesto the left and right of the relation phrase. Finally, ReVerb uses a logistic regression classifier to assign aconfidence score to each extraction triple obtained, based on features and their corresponding weights,generated from manually labeling extractions from the Web and Wikipedia.In the end, developers proved that ReVerb’s performance was fairly higher compared to previous men-tioned systems. Each system was supplied with 500 sentences sampled from the Web using Yahoo’srandom link services6. Results were then compared with human annotated binary extractions, and theyshowed that ReVerb’s precision is much higher than other systems at nearly all levels of recall, achiev-ing an area under precision-recall curve 30% higher than WOEparse and more than double compared toTextRunner and WOEpos.ReVerb also proves to be a good resource of training data, as another experimental result was obtainedby comparing TextRunner-R, TextRunner using positive and negative data classified by ReVerb, andoriginal TextRunner, that uses the Penn Treebank dataset for training its CRF. TextRunner-R achievesan area under precision-recall curve 71% higher than TextRunner, reaching the same performanceas WOEpos, although much less precision compared to ReVerb. A comparison between ReVerb andReVerb¬lex – the ReVerb system without applying the lexical constraint – showed that the biggest im-provement came from the lexical constraint as it was able to reduce the number of uninformative andover-specified relations.

3.2.4 Open Language Learning for Information Extraction

ReVerb and WOE share two important weaknesses:6http://random.yahoo.com/bin/ryl

25

V | V P | VW ∗PV = verb particle? adv?

W = (noun|adj|adv|pron|det)P = (prep|particle|inf. marker)

Figure 3.3: ReVerb’s syntactic constraint, i.e. part-of-speech-based regular expression.

1. They only extract relations mediated by verbs;

2. Both ignore context, thus often extracting tuples that are not asserted as factual, as they performa local analysis of the sentence.

OLLIE [56], Open Language Learning for Information Extraction, is an OIE system that overcomesthe aforementioned limitations by expanding the syntactic scope of the relation phrases to cover a muchlarger number of relation expressions. It also expands the OIE representation to allow additional contextinformation, such as attribution and clausal modifiers.Since ReVerb’s verb-based expression is capable of covering a broad range of relations, OLLIE useshigh confidence seed tuples generated from ReVerb. For each seed tuple, all sentences that containthe content words are retrieved. The sentences match if the arguments from the tuple match to thesentence’s arguments and if the relation is the same or has a similar variation. To avoid bootstrappingerrors, the authors also check if the arguments and the relation can be linked to each other via a linearpath of size four in the dependency parse, using the Malt Dependency Parser for the dependency pars-ing and Stanford’s collapsed dependencies to compact the parse structure.OLLIE’s second step is to learn the patterns that encode various ways of expressing relations, calledopen pattern templates. To learn the patterns, the dependency path connecting both arguments and therelation is extracted, resulting in a tuple. The tuple is then created by normalizing the auxiliary verb tobe and by replacing the relation content word with rel. If a node of the dependency parse tree does notbelong to the tuple, it is called a slot node. Slot nodes can be ignored if they do not negate the tuple.Each candidate pattern needs to go through a series of conditions that determine if the pattern is asyntactic or semantic/lexical pattern. Semantic patterns are not as general patterns as syntactic ones,given that they usually have words that put constraints on the relation, therefore needing further treat-ment in order to be an open pattern template. To enable these kind of patterns, the lexical constraintis retained and a list of words is associated, composed of previously seen words in the same pattern.These resulting pattern templates will then be used to match the dependency parse of input sentenceswhere the arguments and relations are later identified.Finally, OLLIE tries to solve the non-factual relations being extracted by adding extra fields, such asAttributeTo and the ClausalModifier field. The extraction of these additional fields is accomplished byfinding a ccomp (clausal complementer) or a advcl (adverbial clause) edge to the relation node on thedependency parse structure and matching: (1) the context verb of the ccomp clause with a list of com-munication and cognition verbs from VerbNet (for AttributeTo field detection), and (2) the first word ofthe advcl clause with a list containing similar words as ’if, when, although, because’. Since the extrafields cannot cover all the cases of nonfactual extractions, OLLIE uses a supervised logistic regressionclassifier that relies on features such as the frequency of the extraction pattern and the presence of theextra fields, to attach a confidence level to an extraction. It can, therefore, avoid non-factual extractionsby ignoring tuples with a low confidence.Finally, analyzing OLLIE’s performance involved a dataset of 300 random sentences from News, Wikipediaand a Biology textbook. OLLIE, ReVerb and WOEparse were executed resulting in 1,945 extractions al-together. Overall, OLLIE has a bigger area under precision-yield curve, about 2,7 times larger thanReVerb and 1,9 times larger than WOEparse. It misses very few extractions and this happens mostly due

26

Web ReVerb SeedTuples Bootstraper

Open PatternLearning

TrainingData

PatternTemplates

Sentence PatternMatching Tuples Content

AnalysisExtracted

Tuples

Figure 3.4: OLLIE’s simplified architecture.

to parse errors. Another experiment showed that the syntactic and semantic restrictions further helpthe precision of OLLIE by comparing three systems: the full OLLIE system, OLLIE without semantic orlexical restrictions, and OLLIE with lexical restrictions but no type of generalization. The results showthat that lexical/semantic restriction greatly improves OLLIE’s performance.

3.2.5 SRL-IE

In contrast to the previous techniques, other studies [14, 15] have been made in which semanticroles were used for the task of OIE. Propbank and FrameNet resources have had significant progressover the last few years, allowing better results on the semantic role labeling task.SRL-IE [14] is a Semantic Role Labeling (SRL) based extractor in order to extract relation tuples. Inbrief, SRL is a NLP task concerned with detecting the semantic arguments that are associated with averb, giving them different semantic roles. Propbank allows the annotation of these roles to arguments,enabling significant progress in SRL systems over the past few years.SRL-IE uses UIUC-SRL [54] as a base system for performing SRL as it achieved the best F1 score onthe CoNLL-2005 shared task. UIUC-SRL does the basic operations relative to SRL task but, in orderto extract relations tuples, its output is converted into a more similar format to that from OIE systems.The verb, along with its modifiers and negation, are considered the relation, while arguments are all theentities found in the text. SRL-IE also limits the extracted relations by considering the ones that have atleast two arguments.SRL-IE and TextRunner were ran on a test corpus of 29,842 sentences of various Web documents.SRL-IE achieved much higher recall and precision compared to TextRunner, although with the cost of amuch larger processing time. While TextRunner took 6.3 minutes, SRL-IE took 52.1 hours. This hugedifference in performance is due to the use of semantic features (i.e., the semantic roles) instead ofTextRunners shallow features. Although SRL-IE has a much higher precision, TextRunner is able tosurpass SRL-IEs performance in some occasions, such as on high redundant or high locality (wherearguments are closer to each other in a sentence) extractions.Two hybrid systems were also developed taking advantage of each system’s strengths: RecallHybridand PrecHybrid. RecallHybrid runs TextRunner on all sentences, leaving some extra time for SLR-IEto run over a random subset of sentences. This hybrid system achieves the best recall as it does notlose any extractions. PrecHybrid avoids tuples with low confidence, low redundancy and with argumentsfar apart and ranks the sentences for extraction, expecting to yield maximum new information. Boththese hybrid systems outperform TextRunner on F1-measure and take less processing time comparedto SRL-IE.

27

3.3 Open-Domain Information Relation Extraction in Portuguese

Only few OIE systems have specifically dealt with Portuguese texts. One of the approaches foundin the literature was a multilingual OIE system [30] that overcomes some of the challenges found whiledoing unsupervised extractions, such as extracting other non-verbal mediated relations and events withmore than two arguments.The proposed approach comes as three steps in a pipeline. First, the authors use TreeTagger7 andDepPattern, to deal with non-English languages, for POS tagging and syntactic parsing of the text, out-putting the dependencies. The results are then transformed into a partial constituency tree, where onlythe constituents of the clause are selected – nouns, verbs and prepositional phrases.Finally, a small set of extraction rules is applied on the clauses obtained from the previous step, in orderto extract triples. These pattern-based rules only consider verb-based triples and extract only one tripleper sentence, similar to Reverb’s approach.Evaluation was done with Wikipedia sentences extracted in four languages: Portuguese, Spanish, Gali-cian and English. Due to DepPattern’s grammars not being complete, the number of extractions was infact lower than Reverb’s triples but from all four parsers, the Portuguese and Galician ones showed thebest performance, at 70% in terms of the F1-score.

Another approach [8] proposed to find the most similar relations from a set of previously annotatedones. The procedure consists on a classifier based on a nearest neighbor classification (kNN) whereeach training example has a weight associated, measuring the similarity compared to the instance beingclassified. A relation here is represented essentially as a quadgram. This representation for each binaryrelation considers:

1. The substring that occurs between the two concepts that constitute the binary relation;

2. The words that occur before the first and between concepts, in a maximum of three tokens;

3. The words between and after the second concept, considering again a maximum window of threetokens.

This representation [47] follows the observation that a relation is usually expressed using words thatappear in one of three basic patterns: before-between, between, and between-after.To create training examples for the classifier, DBpedia’s relations between concepts are extracted. BothWikipedia texts (from both concepts) are then analyzed and segmented into sentences. To obtain onlythe sentences where both entities co-occur, variations of the names are generated by resorting to DB-pedia and Wikipedia data. The selected sentences are then kept as relation examples and generalizedinto a common semantic concept.The search for similar relations is accomplished by using an approximation of the Jaccard similaritycoefficient through a min-wise hashing procedure [11]. The similar examples will then help the kNN

classifier to assign a semantic relation/concept to the target instance.Finally, three different document collections were used in order to perform the evaluation: a SemEvaldataset, a Wikipedia dataset and the AImed corpus. Results show that combining quadgrams of charac-ters, verbs, prepositions and relation patterns results in a better performance. The results also suggestthat 5 to 7 neighbors are ideal in assigning the final result. It is also observable that some relations areeasier to find, possibly due to Wikipedia’s texts evidencing some frequent patterns.

7http://www.cis.uni-muenchen.de/∼schmid/tools/TreeTagger/

28

3.4 Summary

This chapter presented previous work related to Information Extraction in literary texts. Open-domainextraction techniques were also presented, by describing the procedures that current OIE systems im-plement. Most IE techniques focus on English texts, and very few IE studies have reported on resultsfor the Portuguese language. Overall, the chapter’s summary is presented in the following key points:

• Information Extraction in literature focuses mainly on the characters, and therefore the relationsbetween entities are valuable. These relations can identify interactions and relationships betweenthe entities. Some approaches focus on gathering these relations by identifying characters thatare close together, finally presenting them in a network form;

• Other approaches related to Information Extraction in literature focus on extracting interactionsand relationship status between characters in quoted speech. The conditions in which the relationsappear in these is captured, so that classifiers and/or extraction rules can be built in order to extractthese relation instances;

• Since defining the plot and characters involvement seem to be the core issues of most IE papersfocused in literary texts, some might argue that specific-domain extraction techniques are of betteruse. One approach that focused in extracting family relations firstly tried to use an OIE system,ReVerb, to achieve this goal. Although many relation tuples were obtained, none of them statedrelationships between characters, having therefore built a new technique that is able to extractfamily relations between characters in quoted speech;

• After assessing all of these previous IE techniques applied on literary texts, it can be concluded thatmost use supervised methods and other appropriately developed techniques for the task at hand.Most of the techniques presented rely on human created rules and annotations. However, oneprevious approach had tried and failed to use an OIE system in order to extract family relationshipsbetween characters;

• Open-Domain Information Extraction is a new IE technique in focus over my work. Two majorbranches of approaches outline the procedures described in current OIE systems. TextRunner,WOEpos and ReVerb are shallow parsing systems as these rely on features such as POS taggingand NP-chunks to perform the extractions. WOEparse and OLLIE rely more on dependency parsingfeatures, therefore belonging to the dependency parsing branch;

• Finally, the chapter described two previous systems that addressed relation extraction on Por-tuguese texts. One approach focuses mainly on using extraction rules, where the final one usesa classifier to obtain new relation instances. Less techniques focus on information extraction inPortuguese documents and none found focused on either OIE systems or Information Extractiontechniques in Portuguese literary texts.

29

30

Chapter 4

Resolving Named Entities andRelations in Literary Texts

This chapter details how the two OIE systems in focus, namely ReVerb and OLLIE, were adapted tothe Portuguese language, along with the development and incorporation of task-specific NLP models inthese systems. Section 4.1 describes the NLP-specific tasks and details the development of Portuguesemodels. Section 4.2 details the adaptation of the two OIE systems so that they can process and returnextractions from Portuguese texts.

4.1 POS Tagging, Entity Recognition and Parsing Portuguese Texts

In order to effectively use and evaluate the two OIE systems, Portuguese models were requiredto perform the NLP tasks, that are the stepping stones to produce the desired results. These taskswere mainly requiring POS tagging and dependency parsing models. A POS tagger, being one of thesimplest tasks in the NLP pipeline, returns the syntactic category of each word in a sentence. It is alsowith the help of a POS tagger that a dependency parser is able to return its output, namely dependenciesbetween words of a sentence. A NER model was also developed in order to process the sentences fromliterary texts, accentuating the most important sentences as these are the ones that have potentiallygood information to be extracted. The following sections specify the development of each NLP model.

4.1.1 Sequence Tagging Models

Most systems relying on the output of NLP tasks start by receiving plain text and then process it bymaking it go through a pipeline of operations. These operations then return valuable information that islater accessed in order to infer more information from it.As explained in Section 2.1, some of these operations take the input document represented as a plaintext and later divide into sentences and words. Section 2.2, also explains that these sentences andtokens are the basic units needed for some NLP operations, such as POS tagging, NER and dependencyparsing. These tasks can also be referred as sequence tagging classifiers as they receive an input,usually a string of tokenized words representing a sentence, and output the sequence of labels that best

31

suit each string unit based on a statistical model. The way these models were built and used will befurther discussed in the following sections, with each section focusing in one NLP task.

4.1.2 POS Tagging

Part-of-speech tagging is one of the simplest NLP operations and its output is essential to other NLPtasks, for example NP-chunking or dependency parsing. A POS tagger usually receives one or moresentences as input, splitted and tokenized, and returns the sequence of tags that best suit each token(e.g. word) for each sentence.The tagset used in the development of this model belongs to the Universal Part-of-speech Tagset1, whichconsists of twelve syntactic POS categories. Due to a recent growing interest in using and evaluatingmultilingual systems, this tagset was built by extracting similar coarse-grained POS tags across 25languages.[53] Since four different datasets were used to develop the POS tagger model, this smallertagset was used in order to normalize the set of tags of each dataset. Furthermore, having to adapt aprevious system which focused on English texts required the conversion of tag-specific code, which canbe better achieved if the tags from both languages have a match. This is easily accomplished with theUniversal Part-of-speech Tagset.As mentioned in Section 2.1, the attribution of the most suited tags to a specific sentence is doneaccording to a probabilistic model. This model is the result of a training process involving gold answersthat come from annotated datasets. The POS tagger training datasets came from the merge of fourdatasets, namely CINTIL, Floresta Sintatica, Tycho Brahe and a text made available by the UniversalDependencies project contributors. These many datasets were required, not only due to the differentnature of each dataset but also due to the vast universe of content words that can appear in eachliterary text. Possessing a larger training text file will hopefully reduce the number of unknown words,and therefore avoiding errors, while performing the tagging process. Although in theory this might betrue, merging different datasets together might also damage the tagger’s performance. Preprocessingsteps were done before the training process, in order to normalize all datasets and reduce mergingerrors. More details about this preprocessing step are given in Section 5.2.1.Stanford’s POS Tagger framework was used in order to train and evaluate the POS tagging model.In brief, its structure is similar to a bidirectional dependency network. Whereas HMMs and MEMMsare unidirectional approaches (they only use the previous tags to estimate the current tag), bidirectionalapproaches use the next and previous tag. This works by making an initial estimation using observationsof the local environment, then proceeding in using the referred bidirectional dependency network. Finally,a variation of the Viterbi algorithm is used to identify the maximizing sequence [57].The training process was done using the four datasets mentioned previously, after all of them wentthrough a preprocessing step that involved its POS tags and some dataset annotation details. Featuresin the training process involved not only basic tagger features such as the suffix and prefix of words, butalso non-annotated resources. These additional resources were composed by all four datasets in theirtextual form in order to build word clusters. Clusters were built using a open-source implementation2 ofthe Brown clustering procedure and tuned according to the results obtained on different NLP tasks [22].Brown clustering is a hierarchical process that groups words into clusters based on the context in whichthey occur. It creates a function C that maps the vocabulary words V, of a training corpus into k clusters.

C : V → {1, 2, ..., k}

1https://github.com/slavpetrov/universal-pos-tags2https://github.com/percyliang/brown-cluster

32

Quality(C) =1

n

n∏i=1

logP (C(wi)|C(wi−1))P (wi|C(wi)) (4.1)

One cluster can contain several words, all of which are ideally similar to each other, semantically.A greedy clustering process was proposed [12] in order to find an optimum value for k, maximizingEquation 4.1. In this process, each vocabulary word is assigned to a cluster, creating an equal numberof clusters to vocabulary words. Pairs of clusters are then iteratively merged, while maximizing the valueof Quality(C).Another clustering process was proposed where the number of clusters is restricted at start througha parameter m. The most frequent words from the training corpus are divided individually in these m

clusters. An iterative merging process follows where a new cluster is added to the previous m clustersand, again, a pair is merged while maximizing the value of Quality(C).The Equation 4.1 returns a value measuring how well a function C fits the training corpus by computingthe probability of transition P (c|c′) for a cluster c given its predecessor cluster c′, and the emissionprobability P (w|c) of a word w given its cluster c.Then, the created models were evaluated using a 5-fold cross validation technique where results werecompared using different sets of features. Results and their analysis are shown in Section 5.2.1.Lastly, the need to build a POS tagger relied on the concept that ReVerb is focused essentially in aheavily-based POS tag rule. Apart from adapting ReVerb’s most important extraction rule, the systemneeds to be supplied with a POS tagger that is able to accept Portuguese texts and effectively processthem. Moreover, OLLIE requires a dependency parser as it is a system that relies on word dependenciesto be able to extract relation phrases. A dependency parsing model typically requires a previously POStagged input, therefore enforcing the need for a POS tagger.

4.1.3 Entity Recognition

Entity Recognition is a basic NLP operation with the goal to return all the named-entities (e.g. loca-tions, people, organizations) present in text. A NER system accepts as input one or more sentencesand returns its words labeled with the most suited category, according to a probabilistic model.The supplied corpus used to train the NER model is the CINTIL corpus. This corpus considers four typesof entities/categories. Three of these categories are the standard Person, Location and Organization

followed by a fourth Miscellaneous category. The corpus uses the SBIEO encoding in order to establishthe boundaries of each entity, as its generally the encoding that give best results [55].Stanford’s NER framework was the chosen tool to train and evaluate the models. It uses a liner chainConditional Random Field sequence model, coupled with feature extractors. The testing process com-pares the results with the use of various features, as it can be seen in Section 5.1. Besides basicfeatures made available by Stanford’s NER framework, the use of resources based on non-annotatedtext was also considered. These additional features are as follows (more details about the features canbe seen on Section 5.2.1):

• Word Clustering: Clusters were built using the same open-source implementation of Brown clus-tering procedure as referred in Section 4.1.2. Texts used to infer the word clusters consisted inthe merge of the CINTIL corpus (non-tagged) with a document containing news published over thecourse of 10 years, from the Publico newspaper. The clusters were tuned according to the resultsobtained after trying out different parameters on NLP-specific tasks [22];

• Stems: The use of the stem form of each word present in corpus, calculated by using SNOWBALL

33

Stemmer3, a tool that uses the Porter algorithm in order to obtain the stem of a word;

• Gazetteers and name lists: Two name lists were used, each of them containing common femaleand male names taken from Wikipedia lists. Gazetteers were also used, containing names thatbelong to either people, locations and organizations. These were built by extracting both real andfictional (in the literary sense) entities from Freebase;

• Upper-cased: A new source of information, calculated with all the content words from the CINTILcorpus. This new data consists on a value indicating the amount of times a word was seen withthe first letter upper-cased, 1 if verified in all corpus and 0 otherwise.

This task was essential to extract the named-entities present in literary texts, in order to define andsupply the most important sentences to the adapted Open Information Extraction systems. The goal ofthis thesis is to prove whether the OIE approach is suited to extract information at the literary text level,leveraging not only its accuracy but also if it suits as an approach to extract content information, helping,for example, in the population of knowledge bases. Separating the sentences containing entities fromthe non-important ones will facilitate the evaluation process, done in Section 5.2.1.

4.1.4 NP-Chunking

NP-chunking is a partial parsing operation in which only the NP chunks are required. These chunksconsist on mainly nouns and their modifiers, expressing therefore a noun phrase. Often, chunking re-quires the help POS tags in order to improve the performance of the chunks returned.Building the chunker model involved two frameworks which resulted in two final models. Stanford’sParser framework developed the main chunker model and another model was mistakenly built usingOpenNLP’s framework. The latter derived from the assumed inability of Stanford’s Parser to returnNP-chunks. Stanford’s Parser builds models that can only return either constituent parsing and/or de-pendency parsing, therefore the initial decision to build a model following OpenNLP’s framework, asit returns direct chunks. However, both models were used for NP-chunking. When Stanford’s Parsermodel returns a constituent parsing tree for each input, regular expressions are used in order to obtainthe closest chunks to each leaf node (i.e. word).Three datasets possessed manually annotated syntactic tags: CINTIL, Floresta Sintactica and TychoBrahe. The final dataset used for OpenNLP’s training process was composed by all of these threedatasets merged together. Since all of these datasets contained constituency parsing annotated sen-tences, a preprocessing step was done in which the constituent tags right above the leaf nodes (i.e.words), in the tree, were extracted. OpenNLP’s training file format had to resemble a POS tagged file,where the words, tags and chunks appeared in tabbed separated columns. From this point, the trainingdatasets (in a new format) were normalized by altering their constituency tags to broader and basic tags,followed by the restructure of the annotation details in order to match the tokenization decisions fromprevious models. At last, BIO encoding was added to the chunks in the training file.In order to build Stanford’s Parser model, the constituency tags were also transformed into broader tagsalthough, by experimenting each dataset, only one dataset was chosen to train the final model: theTycho Brahe dataset. Tycho revealed a better consistency of sentence structure and was able to detectmore NP chunks than the other datasets. However, this model was trained using the constituency pars-ing examples directly and returns the whole tree when given a sentence. Since only the NP-chunks wererequired, only the smallest NP-chunks were extracted from the outputted tree during the NP-chunking

3http://snowball.tartarus.org/algorithms/portuguese/stemmer.html

34

task.Finally, both models were kept for ReVerb’s NP-chunking task. OpenNLP’s chunker model is able to pro-cess sentences much faster, being used for heavy-computation tasks only, such as ReVerb’s dictionarybuilding task. However, Stanford’s frameworks are known for achieving higher accuracy and, therefore,was the main model used.

4.1.5 Dependency Parsing

Dependency parsing is a NLP operation which returns the relations between words when given asentence as input. These relations, also called dependencies, define an association between two wordsthat state a grammatical function. A dependency parsing model, when given sentences as input, willreturn a dependency structure for each sentence. These structures contain a root word, usually a verb,followed by dependencies to the rest of the words that, in a visual format, resemble a nested tree. Thistask is usually performed last, as it requires the output of the POS tagger to help define the lexical cate-gory (i.e. tag) of each word. Knowing the lexical category of each word beforehand improves the resultsof the dependency parser, since these associations state an underline grammatical connection betweentwo words.The label set used for the dependencies in this work phase belong to the Universal Dependencies4.The Universal Dependencies consists of a harmonized set of dependency relations that exist acrossmultiple languages, therefore facilitating the development of multilingual parser tools. This universal setof dependencies is now provided for over 40 languages, where the annotation scheme is based of ofStanford basic dependencies [45].Once again, the selected approach to develop this task model was provided by Stanford’s Parser frame-work5. This framework provided two ways to construct models that would be able to parse Portuguesetext and return its dependencies. The first approach [21] consisted in getting the dependencies froma constituency model, using constituency parsing annotated examples as training resources. Depen-dency parse and constituent parse structures represent the sentence in different ways: dependenciesare made between single words while constituent parse often nests multi-word constituents. However,both contain similar information. Generating dependencies from constituency parse structures is doneby essentially applying rules (or patterns). When given a phrase structure of a sentence, it starts thedependency extraction method by assigning head words to every constituent node of the tree. Thesehead words are usually the semantic words of the current constituent, leaving the other words as mod-ifiers (i.e. dependents) of the head words. Next, the grammatical relations are defined between thehead words and the rest of the non-content words. These are assigned according to some patterns.Each pattern is matched against every tree node of the constituency parse, outputting the grammaticalrelations that match between the nodes.The second, and chosen, approach relies on the use of CONLL formatted training files, in which themodel is trained by giving it direct dependency examples. This approach [13] uses a transition-baseddependency parser along with a neural network classifier to make the parsing decisions. The transition-based parser uses a greedy algorithm, aiming to predict a final sequence of transitions (i.e. arcs)that lead to a dependency tree. It employs the standard-arc system that consists in a configurationc = (s, b, A), where s is the stack, b the buffer, and A the set of dependency arcs. When the parserstarts, b contains all the words of the sentence, s contains the single node [ROOT ] and A is empty. Thearc-standard system performs three types of transitions: (1) LEFT-ARC, adds an arc word1 → word2 to

4http://universaldependencies.org5http://nlp.stanford.edu/software/lex-parser.shtml

35

A and removes word2 from stack; (2) RIGHT-ARC, adds an arc word2 → word1 to A and removes word1from stack; (3) SHIFT, moves the top word from the buffer to the stack. The parser shifts the words fromthe buffer to the stack, while the neural network predicts the correct transition based on a set of words,Sw, from the stack and buffer. Two more sets are also provided to the neural network, containing thecorresponding POS Tags, St, and the arc labels, Sl, of the set of words. The neural network is thenbuilt with the corresponding Sw, St, Sl embeddings, which are added to the input layer. Similarly, thePOS Tag and arc label features are also added to the input layer. The input layer is then mapped to ahidden layer through a cube activation function, ending in a softmax layer on top of the hidden layer formodelling multi-class probabilities. The transition-based dependency parser ends when the buffer b isempty and the stack s contains only the single node [ROOT ] as initially.Training the dependency parser model involved two datasets in the CONLL format. Both these datasetswere made available by the Universal Dependencies project contributors. However, both of them con-tained different label set versions. A preprocessing step was therefore needed in order to bring bothdatasets to the most recent set version. Other preprocessing steps were also performed, focusing againin the normalization of POS tags and the restructure of the annotation, maintaining the tokenization con-sistency between previous developed models.Training process also required word embeddings in order to improve the performance of the parser.Word embeddings were created with the word2vec6 tool using all the available datasets. This tool cre-ates and maps d-dimensional vectors to each word of the vocabulary. The creation of these vectors isaffected by the neighboring words within a fixed-sized window of the current word being mapped. Sim-ilar vectors represent a semantic proximity, meaning that words with similar vectors can be replaced inone sentence, not altering its validity. More details about the preprocessing steps and the results of theparser, can be seen in Section 5.2.1.Finally, the need of a Dependency Parser came with the adaptation of OLLIE. The OIE system reliesheavily on dependencies in order to extract relation tuples. Dependency parsing is a powerful tool thatis able to provide connections between long distance words in a sentence, therefore surpassing one oftwo of ReVerb’s limitations.

4.2 Open-Domain Relationship Extraction in Literary Texts

Open-Domain Information Extraction is a new technique which is able to process and extract numer-ous types of relations without requiring as much human annotated resources as the traditional approach.Composing this new technique are two main branches of systems that rely on different types of features:shallow parsing and dependency parsing. ReVerb is a system that belongs to the shallow parsing fea-tures branch, as it relies heavily on POS tagging and NP-chunking. ReVerb, as it was described inSection 3.2, also presents limitations that come from the use of more basic features to perform the ex-traction of relations. OLLIE, on the other hand, belongs to the dependency parsing branch, as it usesthe dependencies between words to perform the extractions.The following sections explain in more detail on how these systems were adapted to accept Portugueseliterary texts and able to perform relation extraction.

6https://code.google.com/archive/p/word2vec/

36

V | V P | VW ∗PV = adv? verb particle? adv?

W = (noun|adj|adv|pron|det|verb)P = adv? (prep|particle|inf. marker) adv?

(a)

V | V P | VW ∗PV = adv? verb adv?

W (noun|adj|adv|pron|det|verb)P = adv? prep adv?

(b)

Figure 4.1: Comparison between (a) ReVerb’s original syntactic constraint and (b) ReVerb’s adaptedsyntactic constraint to Portuguese.

4.2.1 Adapting ReVerb

ReVerb is a follow-up system of OIE’s first generation systems, such as TextRunner and WOE. Ittries to avoid the major errors coming from the previous systems by relying its extraction process on twomain rules. These extraction rules are the core of ReVerb, as it aims to fully capture relation phrasesand avoid incoherent and uninformative extractions. While adapting ReVerb, the first step consisted inaltering the syntactic rule. This rule focuses on a part-of-speech-based regular expression that definesthe sequence of POS tags that relation phrases should comply with. The original rule was thereforechanged (Figure 4.1) in order to use the Universal Tagset, with the help of the available treebank map-pings7.With the syntactic rule adapted to the universal tagset, the second step was then to incorporate the POStagger model in order to be able to extract the relation phrases first. Since the Stanford POS taggerframework uses Java, this was not a complicated task and the model was incorporated easily.Both models that accomplish the NP-chunking part were then incorporated. Stanford’s model is usedmore often since its framework is known to possess a higher accuracy. On the other hand, OpenNLP’smodel is much faster, therefore being reserved only for heavy-computation tasks.The first rule aims to capture the relation phrases fully, while the second rule, the lexical rule, tries toavoid overly specific relation phrases by using a dictionary. The dictionary is composed of various re-lation phrases in its stem form and the respective number of different arguments these were seen with.Before returning an extraction, ReVerb checks the dictionary to see if the current relation phrase wasseen at least with 20 (default parameter) different second arguments. The dictionary had to, therefore,be rebuilt as part of the adaptation, as it would reflect on the lexical rule of the adapted system. In orderto accomplish this task, a large dataset was needed since the original dataset used 500 million Websentences. Besides using all previous datasets (CINTIL, Floresta Sintactica, Tycho Brahe, UniversalDependencies supplied text, Publico text news), all Portuguese Wikipedia page texts were downloadedand together were processed by ReVerb to obtain a large number of extractions.Whenever a relation phrase matches the regular expression defined by the syntactic constraint, it willthen transform the relation phrase to its stem form, using a open-source Portuguese stemmer8. After-wards, it searches this stem form in the dictionary file, checking if it exists and if it was seen with, at least,20 distinct arguments. ReVerb’s discards this relation phrase if it does not comply with these constraints.Finally, the confidence function was re-trained using around 200 sentences taken from PortugueseWikipedia articles, expressing relations between the subject entity and the attributes from the infoboxes.ReVerb processed this set of sentences and the resulting extractions were manually annotated as eithercorrect or incorrect. The confidence function then re-learns the weights from a set of features.

7https://github.com/slavpetrov/universal-pos-tags8https://code.google.com/archive/p/ptstemmer/

37

Dataset Lines ExtractionsCINTIL 30,383 35,941Floresta Sintactica 10,138 12,738Tycho Brahe 2,252 4,382Universal Dependencies 12,401 18,487Publico news text 2,548,406 7,127,375Wikipedia PT 3,389,453 7,199,880Total 5,993,033 14,398,803

Table 4.1: Datasets and corresponding amount of sentences and extractions supplied to the new dictio-nary.

4.2.2 Adapting OLLIE

OLLIE is another OIE system that, unlike ReVerb, uses dependency parsing information in order toextract relation tuples. It also tries to compensate some of the weaknesses of previous OIE tools byextracting relations that are not only mediated by verbs and by adding the context of non-factual extrac-tions.OLLIE relies heavily on dependencies and therefore, the first step in adapting this tool was to incorpo-rate the dependency parsing and POS tagging models. Due to incompatibilities between versions ofprogramming languages and the runtime environment, the tagger and parser had to be incorporatedin a Java wrapper. This wrapper uses the tagger and parser to extract the dependencies of the inputsentences and feeds OLLIE with this information in the right format. The wrapper then receives theextractions and returns these to the user.The second step consisted in altering specific code and other resources, such as:

• Modification of the dependency labels and tags stated in code in order to reflect and match themost recent set (the Universal Dependencies and the Universal Part-of-speech Tagset). This wasaccomplished with the help of the available tag mappings9 and older dependency version/StanfordDependencies mappings to more recent ones [20];

• Alterations in language-specific code that affected mostly the open pattern templates’ building.OLLIE achieves the open pattern templates by firstly building patterns and then the templates.The patterns refer to a simple representation of dependency paths between the arguments andthe main content word of the relation phrase (usually a verb), while the templates are a simplerrepresentation of the relation phrase only, encoding the words with a generic labels. As an ex-ample, the following are the templates and corresponding patterns of a simple extraction, in bothbefore and after adaptation:

Before:be {rel} {prep} {arg1} <nsubjpass< {rel:postag=VBN} >{prep:regex=prep (.*)}> {arg2}After:ser {rel} {nmod} {arg1} <nsubjpass< {rel:postag=VERB} >{nmod:regex=nmod (.*)}> {arg2}

The adapted OLLIE encodes the auxiliary Portuguese verbs (i.e. ser, estar, haver, ter ) by trans-forming the relation phrase words into their stem form and comparing them with various lists ofwords. It also encodes the prepositions that usually appear as merged with the dependencynmod. This happens due to the new Stanford CCProcessed dependencies format which sim-plifies the dependency tree by merging mainly preposition arcs to other existing dependencies.

9https://github.com/slavpetrov/universal-pos-tags

38

The older version of Stanford CCProcessed merges them with the prep dependency, as it can beseen on the before example;

• Switched the current English Stemmer for a open-source Stemmer 10 that processes Portuguesewords by using the ORENGO algorithm. The ORENGO is a stemming algorithm developed foran effective suffix-stripping of the Portuguese language. This algorithm performs better than thePortuguese version of the Porter algorithm.

OLLIE is able to extract relation tuples by following a model that contains patterns and templates,expressing not only the relation phrase composition but as well as the association with the dependencypath between the arguments and the relation. These are called open pattern templates, and are savedin a file along with its confidence level.After all previous steps were complete, it was then possible to create the model that contains theseopen pattern templates. The seed tuples came from high-confidence results of the now adapted ReVerb(confidence over 90%) on all the datasets available. These resulting tuples were then filtered, preservingonly the tuples that contained entities in both their arguments.In order to obtain a more diverse set of tuples, all content words from the high-confidence set of tupleswere extracted and matched against all datasets. Potential erroneous tuples were eliminated by filteringthem if their dependency path, linking the arguments and relation words, was above 4. Open patterntemplates were then generated from the merge and processing of these two resulting sets.Finally, the confidence function was re-trained using around 200 sentences taken from PortugueseWikipedia articles. Each resulting extraction was then manually annotated as either correct or incor-rect. The annotated extractions were then given to the confidence function where new weights arecalculated from a set of features.

4.3 Summary

This chapter described the implementation details of the two OIE systems in focus, plus the devel-opment and incorporation of NLP models specific to the Portuguese language into these systems:

• Mostly all models developed used Stanford’s NLP frameworks. Several datasets were used in theconstruction of these models, and preprocessing steps were done in order to be able to minimizethe performance errors when merging these. A new tagset and label set was used in order tohomogenize the datasets, the Universal Part-of-speech Tagset and the Universal Dependencies;

• Besides language-specific code altered, ReVerb’s final version composes an adjusted syntacticrule which conforms to the tagset used in the POS tagger. It incorporates Stanford’s POS taggeras well as Stanford’s syntactic parser to perform the POS tagging and NP-chunking tasks. AOpenNLP chunker model was also trained and added into ReVerb, which was used to only performheavy-computation tasks, such as the construction of the dictionary, as it possesses a higherspeed (although less accurate) compared to the Stanford’s models. A new dictionary file wasbuilt using all datasets available, including a set of news texts from Publico and all Portuguesetexts available from Wikipedia. Finally, the confidence function was re-trained by giving it manuallyannotated extractions that resulted from around 200 Portuguese sentences, taken from Wikipedia;

• OLLIE’s final version incorporates Stanford’s dependency parser and Stanford’s POS tagger, alongwith language-specific code that was transformed and mapped related to the Penn Treebank

10https://code.google.com/archive/p/ptstemmer/

39

tags/Universal Part-of-speech tags, the Stanford Dependencies/Universal Dependencies and theEnglish/Portuguese language. Open pattern templates were then created by using high-confidencetuples from processing all datasets available with the adapted ReVerb. More tuples were obtainedby matching the content words from the high confidence tuples to the sentences from all datasets,in order to get a more extensive set of bootstrapping open pattern templates. All of these tupleswere then filtered, preserving the tuples with only entities in their arguments and with dependencypath link of at most 4. Finally, the confidence function was re-trained by supplying OLLIE withmanually annotated extractions from 200 Wikipedia sentences.

40

41

Chapter 5

Experimental Validation

This chapter details the evaluation of the NLP models and the two adapted OIE systems. Section5.1 describes in more detail the preprocessing steps done in each dataset before starting the trainingprocesses. Section 5.2 presents the results obtained after testing each NLP model with various features,ending with the results related to extractions on literary texts.

5.1 Datasets and Methodology

In the development of this thesis, there was a need to build and use NLP models related to some ofthe basic NLP operations required by OIE systems. Each model had their own set of datasets that, aftersome deliberate fixes, composed each final model. The datasets that took part in the development ofthe NLP models have their statistics shown in Table 5.1. In order to evaluate these datasets combinedwith each other, a normalization preprocessing step was needed beforehand.

For the POS tagging task, all four datasets were used. The first and most important step was to changethe POS Tags of each dataset to a homogeneous set. The final homogeneous set chosen arose fromthe Universal Tagset, which is composed of 12 common part-of-speech tags that exist across severallanguages. Therefore, following the mappings that were available1 and the definitions [45] of each uni-versal tag, each dataset was transformed, using regular expressions, in order to possess only these setof tags.Further improvements were done on the datasets, related to their annotation structure. All contrac-tions were joined together (e.g. de and a resulted in da), as well as all the verbs and their clitics (e.g.encontrar- and -se were joined in order to produce encontrar-se). Although this may seem a transfor-mation step against the standards that current datasets use, where these tokens are always separated,having these merged certainly avoided the takeover of normalization using automatic steps, which couldbring more errors into the final model.The Floresta Sintactica dataset also demonstrated a particularity where some words were joined to-gether with underscores, denoting mostly expressions and representations of single entities. Thesewere separated in order to preserve the uniform boundaries of words.

As for the NER task, only the CINTIL corpus was used in order to create the models. CINTIL corpus

1https://github.com/slavpetrov/universal-pos-tags

42

posed little challenges before models could be trained and evaluated. It already had the representationas the POS tagger final dataset, with the contractions, verbs and clitics joined together. However, theoriginal encoding was changed from the BIO encoding to the SBIEO encoding, as this usually leads tobetter results [55].

Three datasets possessed manually annotated syntactic tags: CINTIL, Floresta Sintactica and TychoBrahe. These datasets were normalized in order to be used for the NP-chunking task. Two modelsended up being developed in this task.OpenNLP required a final training dataset where the words, POS tags and chunks were separated bytabs. The chunks were obtained by capturing the closest constituent tag to the leaf node (i.e word) bymeans of pattern matching. The constituent tags were then transformed into generic tags as all datasetscontained a different set of constituent tags. BIO encoding was then added to the training datasets asthis was required by the framework.The second model built from Stanford’s Parser framework, required constituent parsing trees. Thesetrees also went through a process where the tags were again transformed into generic tags, as the onesthey provided were too specific and complex for the task at hands.Again, contractions and verb clitics were merged together in order to follow the same annotation detailsas previous models.

Finally, for the dependency parsing task, two datasets were used in order to build the final model. Theolder dataset, provided by the Universal Dependencies project contributors had no clear source. Theother one, consisted in Floresta Sintatica dataset labeled with the most recent Universal Dependencieslabel set version. Therefore, the older dataset went through some modifications in order to contain thelatest version of Universal Dependencies. This was accomplished by using available mappings [20] thatcovered most of the dependencies on this dataset. The rest were defined by an automatic process, inwhich a model trained with the Floresta Sintatica labeled the unmapped ones.Boundaries of words were then redefined, where verb clitics and contractions were set together, follow-ing the same decisions as in previous developed models. Dependencies had to be re-adjusted after thisnormalization process.The Floresta Sintatica dataset showed, again, words compounded together by the means of the un-derscore. These words represented single entities, and unfortunately were not separated during thenormalization process. This was a difficult step to be accomplished by automatic means, requiring amanual fix, which was not possible due to the overwhelming number of these instances contained in thedataset.

Dataset Sentences Total Words Unique WordsCINTIL 30383 681874 35203Floresta Sintactica 10138 243103 25595Tycho Brahe 2252 60275 7589Universal Dependencies 12401 305815 22723UD + Floresta 21353 491458 41758CINTIL + Tycho + Floresta 55197 1030724 58833All 55174 1291068 60608

Table 5.1: Statistical characterization of all training datasets.

43

5.2 Experimental Results

In this section, the results obtained and their analysis are shown after several experimental valida-tions with the NLP models developed and the two OIE systems adapted on four different books. Mosttasks were evaluated in terms of precision, recall and F1-measure. Dependency parsing models were,however, evaluated in terms of UAS and LAS. The latter metrics will be explained, accompanying theresults of this NLP model. Section 5.2.1 demonstrates the results of each NLP model developed, whileSection 5.2.2 presents several tests performed using the two OIE systems on four literary books, fol-lowed by the conclusions taken from the obtained results.

5.2.1 POS Tagging, Entity Recognition and Parsing Portuguese Texts

In order to evaluate the NLP models, using local and non-local features, the tests used a 5-fold cross-validation technique. In a k-fold cross validation technique, the dataset is split into k equal size samples.One of the samples is left out for testing, while the rest is used for training the model. This procedurerepeats itself, using all single k samples for the testing process and the rest for training another model. Inthe end, the average score is computed over the obtained k validation results, presenting an estimationof the model’s quality.The common metrics to evaluate NLP tasks are: Precision, Recall and F1-measure. Precision is ametric that expresses the the fraction of correctly labeled instances over the total number of instancesreturned by the system being evaluated, while Recall indicates the fraction of correctly labeled instancesover the total number of gold-answer instances. F1-measure is a metric that balances the results fromboth Precision and Recall, by computing the harmonic mean. Therefore, the presented results for thedevelopment of NLP models trying out various features are shown using these three evaluation metrics.

The POS tagger was built and tested using the Stanford POS Tagger framework. Due to the taggerbeing hard coded, mostly local features were allowed to be used during the training process. The fol-lowing list of features were the ones used in order to train and evaluate the tagger:

• Previous and following word features;

• The two tags before the current word being tagged;

• Suffix of length 5 and prefix of length 4;

• Word clusters.

Even though the Stanford POS tagger allows the bidirectional approach, i.e. taking into account thetags that follow the current word being tagged, the unidirectional approach showed better results whenapplied on the final merged dataset. Tagger test results using these features are shown in Table 5.2.The word clusters feature slightly improves the results, possibly due to being a non-local feature thatonly affects rare words. Having less rare words compared to common ones, the impact on the resultswill be minimal.

The NER model was built and tested using Stanford’s NER framework, with only CINTIL corpus as thetraining dataset. Unlike the tagger framework, the NER framework disposes a lot more feature tweak-ing to train the model, providing the setting of local and non-local features. The following list presentsfeatures associated to each test:

44

• The current, previous and next words, within a window of two tokens;

• The tag before the current word being tagged;

• A word shape function that transforms the current word into a simplified form, encoding eachcharacter according to its type. If the character is a digit, upper-case, or lower-case letter, it returnsd, X, x, respectively;

• The stem form of each word being tagged;

• The cluster that the current word belongs to (i.e. group of similar words);

• Gazetteers composed of real and fictional entities from the typical three categories help NER’sassignment by providing more features for the model to train against;

• Name lists composed of common female and male names also help the NER model assignmentby allowing the learning of more features;

• A value indicating if the current word being tagged is usually seen with the first letter upper-cased,having the value 1 if always true and 0 otherwise.

Table 5.3 presents the performance of the model while trying to classify entity spans and individualtokens. Entity spans are the set of contiguous words that belong to one entity type. These are harder toclassify since the model has to deal with often multi-word entities. Individual tokens, on the other hand,are easier as boundary issues do not concern them.The results in Table 5.3 show that the stems feature really improved the results of the NER. On theother hand, the use of word clusters did not. Facing the statements that NER results improve when thebigger the training data and the number of clusters, we can perhaps conclude that the clusters were ininsufficient number for this task or have errors. Unfortunately it was not possible to further increase thenumber of clusters for this task, as the training data had a significant size and it would take months tocomplete. The rest of the features show a slight improvement on the results, however the true effectmight be hidden by the errors of the word clusters.

Both OpenNLP and Stanford Parser framework did not allow much tweaking relative to the features.Firstly, Stanford’s constituency models were evaluated individually on the classification of NP-chunk tags,using 5-fold cross-validation. Through pattern matching, the NP-chunks belonging to the gold-answersand the obtained tree were extracted. NP-chunks were then compared by checking their ranges. Ifboth ranges matched, the NP-chunk obtained would be considered correct. This process was customdeveloped in order to evaluate these datasets.Although CINTIL possesses higher performance (Table 5.4), it also possesses smaller training sen-tences compared to the other datasets. Tycho was the chosen dataset due to its structure consistencyand higher number of NP-chunks found after a closely performed evaluation that used a sample of sen-tences.OpenNLP’s chunking model was evaluated using 5-fold cross-validation. The training dataset was com-posed of three datasets in total. Results showed 91.11%, 92.03% and 91.57% in precision, recall andF1-measure respectively.

In order to evaluate the dependency parsing models, different metrics were required. Performance of thismodel was measured by the use of UAS and LAS. These metrics present the accuracy of the model: (1)UAS refers to the unlabeled accuracy where the obtained arcs are compared to the gold-answer arcs,without considering the arc labels; (2) LAS is a similar measure, although takes into account the arc

45

Overall Tag Performance

A P R F1Accuracy

Ambiguous WordsAccuracy

Unknown WordsBasic (Bidirectional) 93.20 87.23 87.87 87.31 83.66 74.74Basic (Unidirectional) 94.40 91.83 88.54 89.98 92.94 93.84+ Word Clusters 94.39 91.83 88.63 90.05 93.70 93.80

Table 5.2: Tagger results from testing with 4 datasets merged, using 5-fold cross-validation.

Entity Spans Entity SpansPER LOC ALL

P R F1 P R F1 A P R F1

Basic 93.64 87.22 90.31 91.91 84.71 88.16 n/a 92.62 85.20 88.76+ Stems 99.12 98.14 98.63 97.37 96.85 97.11 n/a 98.22 96.92 97.56+ Word Clusters 97.09 93.26 95.14 92.29 89.15 90.69 n/a 94.79 89.99 92.33+ Gazetteers 97.29 93.75 95.49 91.89 90.75 91.31 n/a 94.76 90.60 92.63+ Upper-cased 97.09 94.10 95.57 91.90 90.86 91.37 n/a 94.81 90.88 92.80

Entity Spans Individual TokensORG MSC

P R F1 P R F1 A P R F1

Basic 91.89 85.86 88.77 91.95 77.69 84.21 98.12 92.61 87.73 90.04+ Stems 97.65 96.16 96.90 97.90 94.51 96.18 99.01 96.10 93.76 94.88+ Word Clusters 93.35 88.61 90.91 94.56 83.42 88.63 98.09 92.73 87.13 89.76+ Gazetteers 93.62 89.15 91.32 93.77 82.75 87.91 98.22 92.62 87.80 90.07+ Upper-cased 93.71 88.81 91.20 94.64 84.38 89.22 98.26 92.84 88.15 90.37

Table 5.3: NER results obtained from testing with the CINTIL corpus using 5-fold cross-validation.

labels. In other words, UAS is computed by calculating the accuracy of the assigned arcs without thegrammatical relations attached (in Equation 5.1), while LAS is calculated by considering both the arcand the grammatical relations (in Equation 5.2).

UAS =#correct arcs discarding labels

#all arcs(5.1)

LAS =#correct arcs considering labels

#all arcs(5.2)

The dependency parsing model was built and tested using the Stanford Parser framework, in which twoapproaches were provided in order to obtain a dependency model. The selected approach used depen-dency labeled training examples.Two datasets were used in order to train the model. Both were selected after evaluating them on a smallsample of test sentences that belonged to OLLIE tests. The idea was to check if the dependencies didnot diverged much from the expected ones, since OLLIE heavily relies on dependency parsing. Resultsshowed a higher number of correct dependencies from the model containing the two datasets, thereforethe decision to maintain both in the final model.Unfortunately, it was not possible to test the model with several features as the only tuning parameteravailable was the word embedding’s dimensionality. Tests done with high-dimensional vectors mani-fested bad results, reaching 12% in UAS. Best results were achieved by using 50-dimensional wordvectors, reaching 79.98% UAS and 75.91% LAS. Results were obtained by testing the two datasetsusing a 5-fold cross-validation technique (Table 5.5).

5.2.2 Open-Domain Relationship Extraction in Literary Texts

ReVerb and OLLIE were the two systems adapted in the course of this thesis, resulting in these

46

Datasets P R F1

CINTIL 72.29 75.08 72.98Floresta Sintactica 68.22 67.11 67.31Tycho Brahe 66.05 66.48 65.48

Table 5.4: Constituent parsing results of each datasets, using 5-fold cross-validation.

Datasets UAS LASOut-of-vocabulary

Words(% of testing corpus)

Floresta Sintactica 84.21 80.23 10.75Universal Dependencies 82.44 78.78 8.01Floresta + UD 79.98 75.91 7.95

Table 5.5: Dependency parser results varying datasets, using 5-fold cross-validation.

two systems being able to process and extract Portuguese relation tuples, following the recent OpenInformation Extraction paradigm. Both systems were tested on a set of sentences from four differentbooks on public domain:

• Os Maias, written by Eca de Queiros;

• Orgulho e Preconceito, written by Jane Austen, translated by Lucio Cardoso;

• Amor de Perdicao, written by Camilo Castelo Branco;

• Alice no Paıs das Maravilhas, written by Lewis Carrol, translated by Isabel De Lorenzo.

These books were selected from a public domain list due to having different writing styles and beingnarrated in the third person, thus creating a bigger diversity and therefore, allowing a more extensiveevaluation. Books containing translated English texts were also considered since they could give writingcues and differences from original Portuguese literary texts. Nonetheless, all considered texts are inPortuguese, as it is the language in focus of this dissertation.

The metrics used to evaluate the new adapted OIE systems are the same as the ones used to eval-uate the NLP models. However, due to the lack of labeled data, getting the exact Precision and Recall ofa large amount of sentences is difficult in this situation, since it does not exist gold-answer relations fromPortuguese literary texts. Therefore, a smaller set of sentences was selected and manually annotatedin order to perform the evaluation. The detailed results will be shown over a smaller set of sentencesusing, again, (1) Precision which defines the percentage of correct extraction instances over the total ofextraction instances returned; (2) Recall, the fractions of correct extraction instances over the total gold-answer extractions (this parameter will be a manually annotated estimation); and finally, (3) F1-measure,the harmonic mean between Precision and Recall metrics.

Most related work that focused in Relation Extraction in literary texts, were more concerned in capturingthe plot and the character’s interactions. Therefore, tests will focus in asserting only the extractions thatcapture a relation between two entities. The relation can be anything, as the type of interaction is notthe subject in focus in this dissertation. Firstly, all books go through some necessary steps, in whichsentences are filtered, preserving only the ones that contain 2 or more entities. Co-reference is not usedduring these tests, therefore the entities have to appear explicitly in text and the obtained relation tuples.Between OLLIE and ReVerb, the latter one imposes more restrictions when performing the extractions.The following is a brief reminder of ReVerb’s limitations when extracting relation tuples:

47

Literary Book ContiguousRelations

Non-contiguousRelations

ImplicitRelations

Amor de Perdicao 18 20 2Alice no Paıs das Maravilhas 10 11 0Orgulho e Preconceito 17 16 2Amor de Perdicao 18 20 2Total 69 68 6

Table 5.6: Statistical characterization of the relations found on 92 sentences of four different literarybooks.

1. Cannot extract relation tuples when dealing with non-contiguous phrase structures;

2. Relation phrase must be located between the arguments.

Facing these limitations, a set of 40 sentences was selected for the tests. From these 40 sentences,half are in agreement with Reverb’s extraction constraints and the other half were randomly selectedsentences. The following step consisted in manually extracting and annotating relation tuples of theform (entity1, relation, entity2) from these 40 sentences. Several types of relations were found duringthis process:

• Relations following ReVerb’s limitations, where the relation phrase appears explicitly between twoentities and these are contiguous;Example: Quando minha sobrinha Georgiana foi para Ramsgate no Verao passado, fiz questaode que dois criados homens a acompanhassem.

• Relations where the phrase structures are non-contiguous, often having verbs in later positions ofthe sentence referring to long distant arguments;Example: Depois de jantar Carlos percorreu o Fıgaro, folheou um volume de Byron, bateu caram-bolas solitarias no bilhar, assobiou malaguenas no terraco - e terminou por sair, sem destino,para os lados do Aterro.

• Implicit relations. These mostly arrive from special phrase structures that always appear betweencommas, and often specify details that are interesting to capture. These type of relations are notmediated through a verb, unlike the ones mentioned previously.Example: A este tempo, Manuel Botelho, cadete em Braganca, destacado no Porto , licenciou-se para estudar na Universidade as matematicas.

More statistics about the type of relations found can be seen in Table 5.6. As it can be seen, there isa similar distribution of contiguous and non-contiguous relations found, which is ideal for the proceedingtests.

Results from the first test are shown in Table 5.7 using ReVerb in its default form (syntactic constraintand lexical constraint enabled) on the set of sentences. Results show the evaluation of obtained extrac-tions which their level of confidence was higher than 50%.Table 5.7 presents various columns detailing the obtained results. As it can be seen, the first test didnot reveal good results. After examining the failed extractions, it was clear that most were incomplete asthey failed to capture the second entity. This is due to the dictionary, filtering out long and too specificrelations which lead to the capture of the wrong NP-chunk. Reviewing ReVerb’s dictionary, where themost frequent relation phrases appear at the top, it was clear that it contained mostly short-sized relationphrases.Since the manually extracted relations, from each book, revealed longer relation phrases, which were

48

not covered in the dictionary, a second test was performed. In this second test, ReVerb’s lexical con-straint was disabled. Disabling lexical constraints could worsen the results. However, Portuguese literarytexts exposed a frequent use of punctuation and prepositions. Prepositions break off the match of thesyntactic constraint, as it can be seen on the POS tag-based rule in Chapter 4, Figure 4.1, while punc-tuation break either the match of relation phrases and NP-chunks. The dictionary would be more usefulin situations where the text showed short and simpler relations, which does not seem to be the case onliterary texts. The following example shows a manually extracted relation, presenting a longer relationphrase due to figure of speech, and its simpler equivalent.

Carlos fez voar o coupe ate a rua de S. Francisco↓

Carlos foi a rua de S. Francisco

Hence, a second test was performed, having the lexical constraint disabled and considering only theextractions in which the confidence level was above 50%.The second test, shown in Table 5.8, presents better results. As expected, ReVerb was able to capturemore relations tuples (given that these are in contiguous phrase structures) by disabling the dictionary.The frequent punctuation also plays a big part, as it helps defining the boundaries of the obtained ex-tractions. Other extractions keep failing due to NP-chunking errors and relation phrases not matchingthe syntactic constraint, besides failing in capturing extractions that do not present suitable conditionsfor ReVerb.NLP task errors were expected, however, errors related to the syntactic constraint deserve more atten-tion. After matching all the manually extracted relations against the syntactic constraint, some faileddue to the prepositions in the relations phrases. There were numerous cases were these were at fault,mainly due to the following reasons:

• As it was mentioned in Section 5.1, contractions were merged as a step of the normalizationprocess of the datasets. Contractions are mostly composed of prepositions and a determinant,resulting in a determinant tag being assigned to them after the merge. Having these merged, andtagged as a determinant, damaged ReVerb’s performance as a number of long relation phrasesdid not had the required preposition to end them. Determinants were found instead of the wantedprepositions;

• Other relation tuples would contain more than one preposition, which would make the relationphrase break off earlier than what was supposed to. For example, the sentence Manuel Botelhomudou de regimento para Lisboa contains two prepositions, resulting in the following incorrectextraction (Manuel Botelho; mudou de; regimento).

Although changing the syntactic constraint is possible and easily accomplished, the amount of sen-tences, in which two prepositions appeared in a relation phrase, was significantly lower compared to theother issue found. One can also argue if the change would be beneficial, since it could bring longerrelation phrase compositions, to the already long ones found during the manual annotation. Having twoprepositions found in a relation phrase can also present an evidence that ReVerb cannot fully capturethese kind of relations that employ a less direct writing style, for example:

Elizabeth foi levada ate a carruagem por Mr. Collins.↓

Obtained: (Elizabeth; foi levada ate; a carruagem)Expected: (Elizabeth; foi levada ate a carruagem por; Mr. Collins)

49

Amor de Perdicao OverallPerformance

Sentences CorrectExtractions

OutputExtractions

TotalRelations P R F1

24 3 19 40 15.79 7.5 10.17

Alice no Paıs das Maravilhas OverallPerformance


OutputExtractions


17 1 7 21 14.28 4.76 7.14

Orgulho e Preconceito OverallPerformance


OutputExtractions


24 3 15 38 20.0 7.89 11.32

Os Maias OverallPerformance


OutputExtractions


27 6 20 47 30.0 12.77 17.91

Table 5.7: ReVerb results on four different literary books.

Nonetheless, the major problem consisted in wrong normalization decisions.Finally, some correct relation tuples were not captured due to having a low confidence level. Mostof these extractions were located in later positions of the sentence, therefore having a higher risk ofcontaining the wrong arguments. It would have most likely been beneficial if the confidence function hadbeen trained with more sentences, as the confidence levels in the obtained extractions had big intervalsbetween each other. A better trained confidence function also assigns more accurate confidence scoresto the extractions, therefore allowing a higher precision as higher confidence extractions would not befiltered.Some observations can be concluded from the obtained results. Literary texts present long relationphrases separating entities. Since ReVerb can only extract these in limited conditions, disabling thedictionary can improve the results.

Moving ahead to OLLIE tests, the testing procedure is similar to ReVerb’s procedure. The same setof sentences is given to OLLIE and extractions are returned. Table 5.9 shows the results obtained.Obtained results are not uplifting. One aspect that contributed to this bad performance was certainly theDependency Parser’s model lower performance. OLLIE relies heavily on dependency parsing to performthe extractions. It builds open pattern templates which represent the dependency links between the ar-guments and the relation phrase by means of a simpler dependency labeled structure. Extraction is thenperformed by transforming the input sentences into their dependencies, followed by the matching of thepatterns against the obtained dependencies. Errors can propagate, since one malformed patterns caninduce several wrong extractions. Also, badly obtained dependencies match into the wrong, or none,pattern.After further inspection of the obtained results, it was noticeable that a high number of extractions faileddue to details related to the datasets annotations. Often, wrong extractions such as (que; estudavamedicina em; Coimbra), followed the correct pattern but instead, the subject of the verb referred to aset words that were not valid entities. Both datasets confirmed the existence of this particularity, where

50



OutputExtractions


24 12 22 40 54.54 30.0 38.71



OutputExtractions


17 5 10 21 50.0 23.81 32.26



OutputExtractions


24 11 22 38 50.0 28.95 36.67



OutputExtractions


27 14 28 47 50.0 29.79 37.34

Table 5.8: ReVerb results on four different literary books, without the lexical constraint.

words such as que and onde were marked as subjects of verbs located in later positions of the sen-tence. This particularity persisted from the original datasets as the normalization process did not affectany dependency labels related to the subject label.Besides the high number of wrong extractions, it was also noticeable that the results had somewhat lowrecall. As it was mentioned, OLLIE returns extractions according to a set of open pattern templates.These patterns have an associated confidence which define the probability of seeing that type of extrac-tion in each sentence. According to new bootstrapping model developed, extractions would belong to atmost two extraction patterns. The rest of the patterns had a confidence lower than 9%.

{rel} {arg1} <nsubj< {rel:postag=VERB} >dobj> {arg2} 1.0000

{rel} {nmod} {arg1} <nsubj< {rel:postag=VERB} >{nmod:regex=nmod (.*)}> {arg2} 0.6170

This implies that OLLIE can only extract relation phrases that follow the same templates from thesetwo open pattern templates only. One way to avoid this is to supply the bootstrapping model with a morediverse set of high-confident tuples.Unfortunately, no more tests were performed as OLLIE had core issues related to the accuracy of thedependency parser and a particularity that both datasets showed. Often words such as que were seenas pronoun and therefore were marked as a subject of the sentence, leading to erroneous extractions.

By comparing the results of both tools, it cannot be concluded that these tools are yet suitable toextract relation tuples from literary texts. ReVerb’s tests showed that this tool was only able to capturerelations when in suitable conditions, leaving the rest uncaught. However, OLLIE was able to capturemore types of relations, but errors related mostly to parsing errors lead to a worse performance thanReVerb.

51



OutputExtractions


24 4 21 40 19.05 10.0 13.12



OutputExtractions


17 1 26 21 3.85 4.76 4.26



OutputExtractions


24 9 39 38 23.08 30.0 26.09



OutputExtractions


27 8 41 47 19.51 17.02 18.18

Table 5.9: OLLIE results on four different literary books.

5.3 Summary

This chapter described the test results of each developed NLP task model, evaluation their per-formance given several local and non-local features. Further tests were performed involving the twoadapted systems, presenting the results after evaluating their performance on a set of sentences con-taining possible extractions between entities. The following was concluded from the latter:

• Literary texts present frequent long relation phrases between sentences featuring more than oneentity. These are not captured by ReVerb in its default settings – both the syntactic and lexicalconstraint enabled. OLLIE can capture these since the dependencies link the main relation contentwords to the arguments of a relation;

• The decision of merging the contraction words during the normalization process affected ReVerb’sperformance, since these words’ final tag was set as the determinant tag. ReVerb’s syntactic con-straint possesses, generally, three branches of regular expressions covering most of the relationphrases. Both the medium and the long require a preposition to close the relation phrase. When-ever one is not found, only the short relation phrase would be matched, leaving the output relationtuple incomplete;

• OLLIE’s potential to capture most type of relations present in literary texts was overshadowedby the Dependency Parser’s errors. Also, both training datasets presented a particularity wherewords such as que, were frequently tagged as pronouns and assigned the subject of the sentence,leading to many wrongly obtained extractions;

• Finally, by comparing both systems, it can be concluded that ReVerb is not yet suited for extractingrelation tuples from literary texts as it only captures contiguously ones. When the conditions are

52

met, ReVerb does an average job of finding these. OLLIE, however, had its performance over-shadowed by dependency errors, leaving its results somewhat in the open as further tests couldnot be performed. OLLIE does provide the potential to overcome ReVerb but relies heavily on theparser results.

53

54

Chapter 6

Conclusions and Future Work

This document presented every detail related to the development of this MSc thesis. In the end, severalNLP models were built, capable of processing Portuguese texts, using datasets which are in agreementto the Universal Part-of-speech Tagset and Dependencies Guidelines. The adaptation of two relationextraction systems under the new OIE paradigm brought an interesting contribution, as there is not muchwork concerning the use of Open Information Extraction on Portuguese texts, specially on Portugueseliterary texts.Across this document, concepts and related works in the field of relation extraction were revisited, fol-lowed by the work accomplished in order to obtain the needed tools needed for evaluation. Finally, theresults were presented, from the extensive evaluation related to the development of this thesis.Unfortunately, it was not possible to conclude that the adapted systems were fit for relation extractionon literary texts. ReVerb reached a better performance although it could only extract relation tuples inspecific conditions. OLLIE has a much higher potential of accomplishing the task at hands, but errorson the dependency parsing task worsened the results compared to ReVerb.

6.1 Main Contributions

The following contributions are the summary of this MSc thesis:

• I created and evaluated POS tagging, NER and dependency parsing models for the Portugueselanguage, using Stanford’s software framework for training and testing these models. The datasetsinvolved in each task model training process, mentioned in Section 1.1.2, went through a processof normalization following the Universal Part-of-speech Tagset and the Universal Dependenciesguidelines. More details on the normalization process are presented in Chapter 4, while detailsabout the evaluation are given in Chapter 5. These models were then tested using a 5-fold crossvalidation technique, were the obtained F1-measure is 90.05%, 90.37% for POS Tagger and NER,respectively. Dependency parser model reached 79.98% UAS and 75.91% LAS;

• I adapted ReVerb, a tool for performing relation extraction following the OIE paradigm to extractrelation phrases and their arguments from Portuguese texts. The main differences from the origi-nal tool, besides implementation details related to English and Penn Treebank tags specific code,reside on the creation of the new ReVerb dictionary and the confidence function re-training. Re-Verb’s dictionary was built using the datasets in Section 1.1.2 plus all of Wikipedia’s content pagesin Portuguese and a set of news from Publico. The confidence function was trained by manuallyannotating obtained extractions from around 200 sentences taken from Wikipedia articles. ReVerb

55

reached between 32.26%-38.71% of F1-measure, after evaluating it on a set of sentences fromfour different books;

• I adapted OLLIE, a relation extraction tool following the OIE paradigm that aims to overcome Re-Verb’s limitations. OLLIE was modified in order to extract relation phrases and their argumentsfrom Portuguese texts. Besides language-specific code related to English, Penn Treebank tagsand Stanford Dependencies, modifications happened mostly on the building and reading of theOpen Pattern Templates, which are essential to extract relation tuples. A bootstrapping model wasbuilt using high-confidence tuples obtained from the adapted ReVerb by processing all datasetsavailable (mentioned in Section 1.1.2 plus Wikipedia Portuguese articles and a set of news textfrom Publico). The confidence function was trained by manually annotating the obtained extrac-tions, again from around 200 sentences taken from Wikipedia articles. OLLIE reached between4.26%-26.09% of F1-measure, after evaluating it on a set of sentences from four different books.

6.2 Future Work

Despite the results obtained, there are more and new interesting approaches to try out for futurework in the area of Relation Extraction in Portuguese texts.One interesting approach would the the use of Semantic Role Labeling to extract relation tuples [14, 15].Although Propbank and FrameNet resources have had a significant increase in amount, Portuguese re-sources exist in a lower amount compared to resources dedicated to the English language. Nonetheless,this approach could be very useful for extracting relation tuples in literary texts as it is a very powerfultool.Another interesting approach would be to use translation in order to take full advantage of existing tools.Since the English language has a wider variety of tagged resources and linguistic tools to process En-glish texts, it would interesting to use translation tools in order to obtain the desired output by usingEnglish available linguistic tools on non-English texts. Converting non-English texts into English texts,followed by the information extraction procedure and, finally, projecting the output back to the sourcelanguage, has been tested before with promising results [28]. This approach allows a more completeexperimental validation in the field, as it opens doors to the various currently available relation extractionsystems which are only focused for one particular language.Other resources are worth being mentioned as potentially helpful resources in similar developments: (1)Polyglot-NER1, not only contains named-entity annotations but also provides a system that can buildnamed-entity annotators for 40 major languages using Wikipedia and Freebase [4]; (2) Colonia Corpusof Historical Portuguese2, contains several POS tagged historical texts from the 16th to the 20th century;(3) Corpus Informatizado de Textos Portugueses Medievais3, a corpus containing several texts rangingfrom the 19th to the 21st century; and finally, (4) Mac-Morpho4, a corpus containing POS tagged Brazil-ian texts.Finally, if I were to continue the work presented in this thesis, ReVerb could definitely go under somefurther modifications in order to improve its performance. The Universal Part-of-speech Tagset does notprovide separate tags for the common nouns and the proper nouns. However, having the NER model toperform entity extraction instead of NP-chunking could definitely reduce errors when finding arguments.

1https://sites.google.com/site/rmyeid/projects/polylgot-ner2http://corporavm.uni-koeln.de/colonia/inventory.html3http://cipm.fcsh.unl.pt/gencontent.jsp?id=44http://nilc.icmc.usp.br/macmorpho/

56

57

Bibliography

[1] Steven Abney. Partial parsing via finite-state cascades. Natural Language Engineering, 2(04),1996.

[2] Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large plain-text collections.In Proceedings of the ACM conference on Digital Libraries, pages 85–94, 2000.

[3] Alan Agresti. Building and applying logistic regression models. Categorical Data Analysis, SecondEdition, 2002.

[4] Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. Polyglot-ner: Massive multilingualnamed entity recognition. arXiv preprint arXiv:1410.3791, 2014.

[5] Lalit R. Bahl and Robert L. Mercer. Part of speech assignment by a statistical decision algorithm.In Proceedings of the IEEE International Symposium on Information Theory, pages 88–89, 1976.

[6] Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni.Open information extraction for the web. In Proceedings of the International Joint Conference onArtificial Intelligence, pages 2670–2676, 2007.

[7] Michele Banko, Oren Etzioni, and Turing Center. The tradeoffs between open and traditional relationextraction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics,pages 28–36, 2008.

[8] David Soares Batista, David Forte, Rui Silva, Bruno Martins, and Mario Silva. Extraccao de relacoessemanticas de textos em Portugues explorando a DBpedia e a Wikipedia. Linguamatica, 2013.

[9] Thorsten Brants. TnT: a statistical part-of-speech tagger. In Proceedings of the Conference onApplied Natural Language Processing, pages 224–231, 2000.

[10] Sergey Brin. Webdb workshop at 6th international conference on extending database technology.In The World Wide Web and Databases, pages 172–183. 1998.

[11] Andrei Z. Broder. On the resemblance and containment of documents. In Proceedings of theConference on Compression and Complexity of Sequences, pages 21–29, 1997.

[12] Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. Class-based n-gram models of natural language. Computational linguistics, 1992.

[13] Danqi Chen and Christopher D Manning. A fast and accurate dependency parser using neuralnetworks. In EMNLP, pages 740–750, 2014.

[14] Janara Christensen, Stephen Soderland, Oren Etzioni, et al. Semantic role labeling for open infor-mation extraction. In Proceedings of the NAACL HLT International Workshop on Formalisms andMethodology for Learning by Reading, pages 52–60, 2010.

58

[15] Janara Christensen, Stephen Soderland, Oren Etzioni, et al. An analysis of open informationextraction based on semantic role labeling. In Proceedings of the International Conference onKnowledge Capture, pages 113–120, 2011.

[16] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3), 1995.

[17] Aron Culotta and Jeffrey Sorensen. Dependency tree kernels for relation extraction. In Proceedingsof the Annual Meeting on Association for Computational Linguistics, page 423, 2004.

[18] James R. Curran, Tara Murphy, and Bernhard Scholz. Minimising semantic drift with mutual exclu-sion bootstrapping. In Proceedings of the Conference of the Pacific Association for ComputationalLinguistics, pages 172–180, 2007.

[19] Peter T. Davis, David K. Elson, and Judith L. Klavans. Methods for precise named entity matchingin digital collections. In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries,pages 125–127, 2003.

[20] Marie-Catherine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, JoakimNivre, and Christopher D Manning. Universal stanford dependencies: A cross-linguistic typology.In LREC, pages 4585–4592, 2014.

[21] Marie-Catherine De Marneffe, Bill MacCartney, Christopher D Manning, et al. Generating typeddependency parses from phrase structure parses. In Proceedings of LREC, pages 449–454, 2006.

[22] Leon Derczynski, Sean Chester, and Kenneth S Bøgh. Tune your brown clustering, please. RANLP,2015.

[23] Jason M. Eisner. Three new probabilistic models for dependency parsing: An exploration. InProceedings of the Conference on Computational Linguistics, pages 340–345, 1996.

[24] David K. Elson, Nicholas Dames, and Kathleen R. McKeown. Extracting social networks from liter-ary fiction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics,pages 138–147, 2010.

[25] David K Elson and Kathleen McKeown. Automatic attribution of quoted speech in literary narrative.In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1013–1019, 2010.

[26] Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soder-land, Daniel S. Weld, and Alexander Yates. Unsupervised named-entity extraction from the web:An experimental study. Artificial intelligence, 2005.

[27] Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying relations for open information ex-traction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing,pages 1535–1545, 2011.

[28] Manaal Faruqui and Shankar Kumar. Multilingual open relation extraction using cross-lingual pro-jection. NAACL, 2015.

[29] Charlotte Galves and Pablo Faria. Tycho brahe parsed corpus of historical portuguese. 2010.

[30] Pablo Gamallo, Marcos Garcia, and Santiago Fernandez-Lanza. Dependency-based open infor-mation extraction. In Proceedings of the Joint Workshop on Unsupervised and Semi-SupervisedLearning in NLP, pages 10–18, 2012.

[31] Barbara B. Greene and Gerald M. Rubin. Automated grammatical tagging of English, 1971.

59

[32] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten.The weka data mining software: an update. ACM SIGKDD explorations newsletter, 2009.

[33] Zellig Sabbettai Harris. String analysis of sentence structure. Mouton, 1962.

[34] Hua He. Automatic speaker identification in novels. PhD thesis, University of Alberta, 2011.

[35] Hua He, Denilson Barbosa, and Grzegorz Kondrak. Identification of speakers in novels. In Pro-ceedings of the Annual Meeting of the Association for Computational Linguistics, pages 1312–1320,2013.

[36] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of theConference on Computational Linguistics, pages 539–545, 1992.

[37] Raphael Hoffmann, Congle Zhang, and Daniel S Weld. Learning 5000 relational extractors. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 286–295, 2010.

[38] Dan Jurafsky and James H. Martin. Speech & language processing. Pearson Education, 2000.

[39] Dan Klein and Christopher D. Manning. Accurate unlexicalized parsing. In Proceedings of theAnnual Meeting on Association for Computational Linguistics, pages 423–430, 2003.

[40] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilis-tic models for segmenting and labeling sequence data. pages 282–289, 2001.

[41] John Lee and Chak Yan Yeung. Extracting networks of people and places from literary texts. InProceedings of the Pacific Asia Conference on Language, Information and Computation, page 209,2012.

[42] Aibek Makazhanov, Denilson Barbosa, and Grzegorz Kondrak. Extracting family relationship net-works from novels. CoRR, 2014.

[43] Andrew McCallum, Dayne Freitag, and Fernando Pereira. Maximum entropy markov models for in-formation extraction and segmentation. In Proceedings of the International Conference on MachineLearning, pages 591–598, 2000.

[44] Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. Non-projective dependency pars-ing using spanning tree algorithms. In Proceedings of the Conference on Human Language Tech-nology and Empirical Methods in Natural Language Processing, pages 523–530, 2005.

[45] Ryan T McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das,Kuzman Ganchev, Keith B Hall, Slav Petrov, Hao Zhang, Oscar Tackstrom, et al. Universal depen-dency annotation for multilingual parsing. In Proceedings of the Annual Meeting of the Associationfor Computational Linguistics, pages 92–97, 2013.

[46] Tara McIntosh and James R. Curran. Reducing semantic drift with bagging and distributional simi-larity. In Proceedings of the Joint Conference of the Annual Meeting of the ACL and the InternationalJoint Conference on Natural Language Processing of the AFNLP, pages 396–404, 2009.

[47] Raymond J. Mooney and Razvan C. Bunescu. Subsequence kernels for relation extraction. InProceedings of the Annual Conference on Neural Information Processing Systems, pages 171–178, 2005.

[48] Franco Moretti. Network theory, plot analysis.

60

[49] Ion Muslea et al. Extraction patterns for information extraction tasks: A survey. In Proceedings ofthe Association for the Advancement of Artificial Intelligence Workshop on Machine Learning forInformation Extraction, pages 1–6, 1999.

[50] Truc-Vien T. Nguyen and Alessandro Moschitti. End-to-end relation extraction using distant super-vision from external semantic repositories. In Proceedings of the Annual Meeting of the Associationfor Computational Linguistics on Human Language Technologies, pages 277–282, 2011.

[51] Joakim Nivre, Johan Hall, and Jens Nilsson. Maltparser: A data-driven parser-generator for de-pendency parsing. In Proceedings of the International Conference on Language Resources andEvaluation, pages 2216–2219, 2006.

[52] Patrick Pantel and Marco Pennacchiotti. Espresso: Leveraging generic patterns for automaticallyharvesting semantic relations. In Proceedings of the International Conference on ComputationalLinguistics and the Annual Meeting of the Association for Computational Linguistics, pages 113–120, 2006.

[53] Slav Petrov, Dipanjan Das, and Ryan McDonald. A universal part-of-speech tagset. LREC, 2011.

[54] Vasin Punyakanok, Dan Roth, and Wen-tau Yih. The importance of syntactic parsing and inferencein semantic role labeling. Computational Linguistics, 34(2), 2008.

[55] Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition.In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages147–155, 2009.

[56] Michael Schmitz, Robert Bart, Stephen Soderland, Oren Etzioni, et al. Open language learningfor information extraction. In Proceedings of the Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural Language Learning, pages 523–534, 2012.

[57] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the Conference of the NorthAmerican Chapter of the Association for Computational Linguistics on Human Language Technol-ogy, pages 173–180, 2003.

[58] Fei Wu and Daniel S. Weld. Open information extraction using wikipedia. In Proceedings of theAnnual Meeting of the Association for Computational Linguistics, pages 118–127, 2010.

[59] Alexander Yates and Oren Etzioni. Unsupervised resolution of objects and relations on the web. InProceedings of the Conference of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology, pages 121–130, 2007.

[60] Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. Kernel methods for relation extraction.The Journal of Machine Learning Research, 3:1083–1106, 2003.

61

62

Documents

Resolving Named Entities and Relations in Text for ... · May 2016. ii. Acknowledgments I would like to express my gratitude to my two supervisors, with special thanks to my main