Open Information Extraction from Dialogue Transcriptions

Open Information Extraction from Dialogue Transcriptions

José Jorge Marcos Raposo

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisors: Doctor David Manuel Martins de MatosDoctor Bruno Emanuel da Graça Martins

Examination Committee

Chairperson: Doctor José Carlos Alves Pereira MonteiroSupervisor: Doctor David Manuel Martins de Matos

Member of the Committee: Doctor Pável Pereira Calado

October 2016

ii

Acknowledgments

To my supervisors, thank you for the support, the guidance, and the help in every way possible

along this entire journey.

To my parents, and grandparents, for their unconditional love, understanding, support, and

patience, without whom I would not be able to be here today.

To my brother Diogo, for the motivation, the help, the advice, the last minute solutions, and

for showing me the way.

To Pedro, for the amazing support and motivation, even from far away.

To Carolina, for the caring and patience, and for helping me grow in every moment of this

adventure.

To Fredy, Simoes, Rita, Joao and Luis, for never giving up on me and pushing me to the

end.

To Rafa, Diogo, Rute, Ed and Francisco, for the tremendous help at the worst time.

To Sara and Mafalda, for all the companionship and encouragement.

To Catarina, Miguel, Lara, Catarina, Mario, Carlos, Ana, Berta, Christelle, Tiago, Filipe,

Anouk, Sofia, Catarina, Adelino, Ana, Antonio, Lipinho, John, Vanessa, Igor, Afonso, Cristiana,

Iryna and Isabel, for that motivation at the right moment that made the entire difference.

Thank You!

iii

iv

Resumo

A Extracao Aberta de Informacao consiste na tarefa de encontrar uma representacao es-

truturada para as relacoes e declaracoes presentes em texto de lıngua natural, sem usar

uma categorizacao predefinida dos tipos de relacoes que irao ser extraıdos. Ferramentas

de extracao aberta de informacao exploram informacao lexical, juntamente com informacao

sintatica e/ou semantica de frases, para procurar relacoes nelas. Aplicar metodos de extracao

de informacao a transcricoes de dialogo, um tipo especıfico de texto que e menos estruturado

e consistente que texto formal, resulta numa reducao de desempenho. Para resolver este

problema, em primeiro lugar, vamos apresentar um estudo comparativo entre quatro sistemas

de extracao aberta de informacao, nomeadamente os sistemas ReVerb, OLLIE, Stanford OIE

e OpenIE 4, com o objetivo de analisar e estudar essa diferenca de resultados. Em segundo

lugar, iremos implementar uma aplicacao que, juntamente com as ferramentas de extracao

mencionadas, resulta num aumento da precisao em 12 pontos percentuais (pps), de sensi-

bilidade em 11 pps, e de F1-score em 11 pps, para os melhores resultados obtidos, o que

melhorou a qualidade das extraccoes. O nosso sistema pre-processa texto de dialogo antes

de este ser passado as ferramentas de extraccao, simplificando e dividindo frases usando

varias tecnicas de processamento de lıngua natural.

Palavras-chave: Extracao Aberta de Informacao, Stanford OpenIE, OpenIE 4, OLLIE,

ReVerb, Dialogo em Lıngua Natural.

v

vi

Abstract

Open Information Extraction (OpenIE) is the task of finding a structured representation for the

relations and assertions present in natural language text, without using a predefined catego-

rization of the relationship types that are to be extracted. OpenIE tools exploit word tokens,

together with syntactic and/or semantic information from sentences, to search for relations in

them. Applying OpenIE methods to dialogue transcriptions, a specific kind of text that is less

structured and consistent than formal text, results in a significant decrease in performance. To

address this issue we first present a comparative study between four OpenIE systems, namely

ReVerb, OLLIE, Stanford OIE, and OpenIE4, intended to analyze and justify the difference in

results. After these initial tests, we implemented an application that, used together with the

aforementioned Open IE tools, increases precision by 11 percentage points (pps), recall by up

to 12 pps, and F1-score by 11 pps for the best obtained results, which improved the quality

of extractions. Our system pre-processes dialogue text before being passed to the extrac-

tion tools, by simplifying and dividing the sentences using several natural language processing

techniques.

Keywords: Open Information extraction, Stanford OpenIE, OpenIE 4, OLLIE, ReVerb,

Dialogue in Natural Language.

vii

viii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 Introduction 1

1.1 Topic Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 Open Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 Part-of-speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.3 Noun Phrase Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.4 Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.5 Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Stanford CoreNLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Open Information Extraction Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 ReVerb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.2 OLLIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.3 OpenIE 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.4 Stanford Open Information Extraction . . . . . . . . . . . . . . . . . . . . 12

ix

3 Introductory Analysis 15

3.1 Dataset Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Open IE: A Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Reported Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Dataset Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Strategy Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Implementation 25

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Input Requisites and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Input Requisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Sentence Cleaner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4 Sentence Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4.1 Coordinate Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.2 Causal Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.3 Temporal Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.4 Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.5 Belief Modifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Relation Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Evaluation 33

5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Conclusions 39

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Bibliography 41

x

List of Tables

2.1 Relation phrase corresponding to the sentence “The dinosaurs became extinct.“ 3

2.2 POS Tagging of Larry Niven’s quote “The dinosaurs became extinct because

they didn’t have a space program”, using the Penn TreeBank tagset. DT: Deter-

miner; NNS: Noun, plural; VBD: Verb, past tense; JJ: Adjective; IN: Preposition;

PRP: Personal pronoun; RB: Adverb; VB: Verb, base form; NN: Noun, singular. . 4

2.3 Pattern used by ReVerb to search for relations. Every relation must be matched

with one of the following rules. ReVerb searches for the longest match it can find,

i.e., it gives priority to the rule VP over the rule V. . . . . . . . . . . . . . . . . . . 10

3.1 Precision and Recall values reported at the TAC KBP Slot Filling Challenge 2013. 19

3.2 Initial results obtained when testing the four extractors with raw dialogue. . . . . 19

3.3 Frequency of different kinds of noise in the incorrect extractions with noise related

problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Frequency and examples of different kinds of subordinations in incorrect extrac-

tions originated from long sentences. Conjunctions are marked in bold. . . . . . 22

5.1 Number of extractions, precision, recall, and F1-score obtained for the ground

truth, and for processed dialogue. The best result in each section is marked in

bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 Frequency of errors encountered in the XML structures, for each task. . . . . . . 37

xi

xii

List of Figures

2.1 Dependency parse for the sentence The dinosaurs became extinct because they

didn’t have a space program. Words are labeled with the respective POS tag and

arcs are labeled with the name of the relation. det : Determiner; nsubj : Nomi-

nal Subject; xcomp: Open clausal complement; mark : Marker; aux : Auxiliary;

neg: Negation Modifier; advcl : Adverbial clause modifier; dobj : Direct object;

compound or nn: Noun compound modifier. . . . . . . . . . . . . . . . . . . . . . 6

2.2 Semantic role labelling of the sentence The dinosaurs became extinct because

they didn’t have a space program. Words are marked with their respective POS

tag, and arcs point to the grammatical function the node has to the head. SBJ:

Subject; NMOD: Nominal modification; PRD: Second predicate; PRP: Purposer;

SUB: Subordinate; ADV : Adverbial modification; VC: Verb Complement; OBJ:

Object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Example of an extraction taken from the sentence “I have two dogs”, with its

confidence value. The first and second arguments are, respectively, I and two

dogs, and the relation phrase is have. . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Precision-Recall curve for every extractor, along with the Area Under the Curve

(AUC). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Arquitecture of our application. The three main modules process the data in

sequence, while the Output Structure Organizer module works in parallel with

them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Example of the output xml structure for the sentence “Last year I though it was a

good sacrifice to make, because I wanted to spend some time with my kids.” . . 27

xiii

4.3 Example of an interruption being identified and resolved. Each > indicates a

different sentence. On the left, the second speaker interrupted the first with a

comment. The algorithm identifies that sentence with less than three words, so

it removes it and joins the first and third sentences, creating the sentence seen

on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1 Comparisons between the Precision-Recall curves of each extractor, measured

for the ground truth (blue), and text processed by our application (red). . . . . . . 35

5.2 Side-by-side comparison of all four precision-recall curves, measure for the dia-

logue processed by our application- . . . . . . . . . . . . . . . . . . . . . . . . . 36

xiv

Chapter 1

Introduction

Open Information Extraction (Open IE) is the task of finding a structured representation for the

relations and assertions present in natural language text, without using a predefined catego-

rization of the relations that are to be extracted. Open IE has several purposes, e.g., to extract

factual information from a given source [1, 2], or to acquire common sense knowledge [3].

The existing tools show good results when applied to grammatically well-formed text, such

as news articles, or an article in Wikipedia. However, when faced with less structured text,

such as dialogue transcriptions, the results decrease. The quality of the grammatical structure

of dialogue texts varies significantly, and dialogue is usually prone to interruptions, repetitions

and mid-sentence corrections, noise, errors, and other problems.

The aforementioned problems limit the usefulness of such tools is limited, as dialogue is a

kind of natural language that is present everywhere. On a daily basis, it is the favored format for

people to communicate with each other. Hence, the information present in dialogues, of many

different kinds (e.g., transcriptions of public speeches, or movie subtitles) can have significant

importance for several types of applications, such as:

• Extracting information from dialogues in surveillance systems;

• Application for real-time validation of topics in a group conversation;

• Automatic relevant information retention for debates;

• Among numerous others.

Current Open IE tools are not prepared to extract useful information from a dialogue. There-

fore, the motivation of the research study presented in this dissertation was to try to improve

the way that Open IE applications process human dialogue.

1

1.1 Topic Overview

In this project, we will start by studying some of the current Open IE tools, specifically ReVerb

[4], OLLIE [5], OpenIE 4 [6] and the Stanford Open IE system (Stanford OIE) [7]. We will see

how these tools process dialogue, in comparison to more structured text. Considering that each

of these tools uses a different approach, the objective is to try to understand which one shows

the best results to the different kinds of dialogues and their problems.

Secondly, and without changing the algorithms behind those tools, different processing

techniques will be applied to the texts in sequential layers, to try to improve the quality of

the data, and the subsequent quality and amount of extracted relations.

1.2 Objectives and Contributions

The three main objectives of this thesis were:

1. To provide a comparative evaluation of four Open IE tools when applied to dialogue;

2. To implement a system that improves the quality of relations extracted from dialogue

transcriptions, used in conjunction with Open IE tools;

3. To propose an improved format for the structure of the information extracted.

1.3 Thesis Outline

This document is organized as follows: Chapter 2 introduces the concept of Open Information

Extraction and provides some background into fundamental concepts from the area of Natural

Language Processing. Chapter 3 provides a comparative study between the four extractors,

when applied to dialogue text. This chapter also analyses the dataset that will be used in the

evaluation. Chapter 4 describes the implementation of a system that will attempt to improve the

results of each Open IE tool. Chapter 5 presents the experimental evaluation of our application.

Finally, chapter 6 presents the conclusions of this dissertation, and highlights possible paths for

future work.

2

Chapter 2

Background

This chapter presents a summary of fundamental concepts necessary to understand the fol-

lowing chapters. Section 2.1 introduces the concept of Open IE. Section 2.2 addresses several

techniques used in natural language processing. Sections 2.3 and 2.4 introduce us to the

different Open IE and NLP tools that will be used in this work.

2.1 Open Information Extraction

Open Information Extraction (Open IE) is the task of automatically extracting structured infor-

mation from text, in the form of relation tuples (a tuple is a finite ordered list of elements). A

relation phrase is a sentence expressing a relation between two arguments. In the example

shown in Table 2.1, the first argument is The dinosaurs and the second is extinct, while the

relation phrase is became.

The dinosaurs became extinct

Argument 1 Relation Phrase Argument 2

Table 2.1: Relation phrase corresponding to the sentence “The dinosaurs became extinct.“

Open IE differs from normal IE by being domain independent. This means that the theme of

the text is not known in advance. In traditional IE systems, the extractions have a very specific

target. These systems use human-made extraction rules, or rules learned by domain-specific

supervised methods. Open IE is instead based on creating the extraction rules automatically,

and making them independent from the context or domain. An Open IE system takes as input

a corpus of raw text, ideally makes a single pass over it, and outputs a set of extracted relations

[2, 8].

3

2.2 Natural Language Processing

Natural Language Processing (NLP) is the field concerned with the interaction between humans

and computers at the level of human language. Open IE is a subtask from this field, so it

takes advantage of several of its procedures. The following subsections will introduce the most

important techniques used by Open IE tools, and by this work.

2.2.1 Tokenisation

Tokenisation is the task of segmenting text into parts, such as words, sentences, or groups in

between. The process of tokenisation is dependent of the text’s language. Usually, punctuation

has to be taken into account, and more than often the segmentation rules are dependent on

the context (for instance, apostrophes are used with more than one purpose). Tokenisation can

be achieved using regular expressions, which are robust although somewhat strict, or through

machine learning methods, which are more general [9].

2.2.2 Part-of-speech Tagging

Part-of-Speech (POS) tagging is the process of assigning a part-of-speech category to each

token in a corpus. Parts-of-speech are classes of words, based of their function in a text, and

based on the similarities in their behaviour. For instance, in a simple tagset with 12 classes

(i.e. a closed group of possible tags), tokens can be classified as nouns, verbs, adjectives,

adverbs, pronouns, determinants or articles, prepositions, numerals, conjunctions, particles

and punctuation marks, as described in [10].

For more complex studies and for a better understanding and classification of text, larger

tagsets can also be used, like the Penn TreeBank tagset that considers 45 classes [11] and the

61 class C5 tagset1. Table 2.2 shows an example of a classified sentence.

The dinosaurs became extinct because they did n’t have a space program

DT NNS VBD JJ IN PRP VBD RB VB DT NN NN

Table 2.2: POS Tagging of Larry Niven’s quote “The dinosaurs became extinct because theydidn’t have a space program”, using the Penn TreeBank tagset. DT: Determiner; NNS: Noun,plural; VBD: Verb, past tense; JJ: Adjective; IN: Preposition; PRP: Personal pronoun; RB:Adverb; VB: Verb, base form; NN: Noun, singular.

To tag a corpus, there are several algorithms available, based on different methods. Modern

1http://www.natcorp.ox.ac.uk/docs/c5spec.html

4

approaches mainly leverage rule-based or probabilistic methods. In both cases, the input is a

string of words and a tagset, and the output is a single best tag for each word.

Rule-based methods generally use a two-stage architecture. In the first stage, a tagger

assigns all possible parts-of-speech for each word. Then the tagger applies a large set of

constraints to the input sentence to filter the incorrect parts-of-speech, returning a single POS

entry for each word. A good implementation of this approach, using the tagger named EngCG2,

is described in [12].

Probabilistic methods, instead of using heuristics, select a tag for a word based on prob-

abilities assigned to each tag. These probabilities are computed with a model built through

supervised learning. A known algorithm based on probabilistic methods is the Hidden Markov

Model (HMM) POS tagger. This algorithm first builds all possible tag sequences for a given

sentence, and then chooses the sequence with the highest probability. The probabilities for

each tag are computed by multiplying two terms: the probability of a certain tag occurring after

another (i.e. a bigram sequence) and the probability of, given a certain tag, being associated

with a word. Both probabilities are measured by counting the specific occurrences in a labeled

corpus. The final part of the algorithm is typically named decoding (i.e. choosing the best tag

sequence) and is done using a dynamic programming algorithm like the one from Viterbi [13, 9].

2.2.3 Noun Phrase Chunking

Chunking, often referenced to as shallow parsing, is the technique used to segment and label

the tokens of a sentence into its subconstituents, such as the noun or verb phrases.

Chunking a sentence requires for tokens to first be POS tagged, due to the fact that the

POS labels are going to be used in patterns for recognizing phrases. Usually, a text chunking

process starts by tokenising the sentence, then it tags it with a POS tagset, and finally chunks

it. There are two main methods to chunk a sentence: using regular expressions and training

chunk parsers.

The first method uses a grammar based on regular expressions, which essentially defines

rules using POS tag patterns. For instance, the pattern < DT >? < JJ.∗ > ∗ < NN.∗ > +

will chunk sequences starting with an optional determiner, < DT >?, followed by zero or more

adjectives, < JJ.∗ > ∗, succeed by one or more nouns, < NN.∗ > +. Several rules like this are

defined, and the chunk parser searches the text for every chunk it can find, without overlapping

with others.

The second method uses a large corpus for training a chunk parser. First a corpus is

2http://ww2.lingsoft.fi/doc/engcg/

5

labelled using a system like the IOB format, where each token in the corpus receives one of

three tags: B if it marks the beginning of the chunk, I if the token is inside the chunk, and O if it

is outside the chunk. Then, the parser goes through the corpus and tries to learn as much tag

patterns as possible, to compare them with the sentences it is working with [14]. The patterns

that are learned rely on lexical items and/or POS tags as features.

2.2.4 Dependency Parsing

Dependency parsing is a method for deconstructing sentences into a graph of dependencies

between words, based on a dependency grammar. Dependency grammars are a form of syn-

tactic representation, i.e. a set of rules and criteria that define binary relations between two

words, in the form of dependencies. Dependency relations have a head and a dependent,

which means that there is always a word that has a dependency to another, and this relation

is asymmetric. These relations are achieved by an operation that takes the POS tags from the

two words and returns a more elaborate arrangement of both. The resultant relation is then

labelled by a tag that defines it, and imposes linguistic restrictions on the linked words [15].

Root

became/VBD

have/VB

program/NN

space/NNa/DT

det compound

n’t/RBdid/VBDthey/PRPbecause/IN

mark

nsubjaux neg

dobj

extinct/JJdinosaurs/NNS

The/DT

det

nsubj xcomp advcl

root

Figure 2.1: Dependency parse for the sentence The dinosaurs became extinct because theydidn’t have a space program. Words are labeled with the respective POS tag and arcs arelabeled with the name of the relation. det : Determiner; nsubj : Nominal Subject; xcomp: Openclausal complement; mark : Marker; aux : Auxiliary; neg: Negation Modifier; advcl : Adverbialclause modifier; dobj : Direct object; compound or nn: Noun compound modifier.

Figure 2.1 shows an example of the resulting dependency graph after parsing the sentence

The dinosaurs became extinct because they didn’t have a space program. For each arc, the

6

word on top is the head, the lower word is the dependent, and the tag in the arc is the name

of the relation. For instance, it is possible to see that the word dinosaurs has a det relation

with the word the. This relation is named determiner and dictates that a determiner (the) must

accompany a noun (dinosaurs) [16].

There are two main classes of methods to parse a sentence, according to syntactic depen-

dencies: grammar-driven and data-driven. In both of them, the input is always a sentence and

the output is a dependency graph [17].

Grammar-based methods for dependency parsing rely only in a well-defined formal gram-

mar. This implies that the parser will be more restrictive. Two possible approaches for parsing

sentences with grammar-driven methods are: (i) conversion to a context-free grammar, in which

dependencies are represented as production rules, and (ii) projection to a constraint satisfac-

tion problem, where the analysis is restricted by a set of constraints present in the grammar.

On the other hand, data-driven methods aim to learn a good predictor of dependency trees,

based on supervised training, which builds a less robust but more general model. Within data-

driven methods, there are two main approaches: (i) transition systems, that at each step of

the process of building the dependency graph try to find the highest scoring dependency in the

sequence, and (ii) graph based systems that, for a given sentence, compute all possible trees

according to the model and choose the one with the highest total score [18].

2.2.5 Semantic Role Labeling

The semantic analysis of text refers to the characterization of events that answer questions like

who did what to whom, where, when and how. Semantic Role Labelling (SRL) is the task of

identifying the relations of a target verb to its corresponding participants, i.e., for every verb

there is a pre-specified list of possible semantic roles, and the task of SRL tries to assign each

role to some noun-phrase element in the sentence [19].

A semantic role is a class in linguistic theory, and the constituents of each class vary ac-

cording to the author. Roles can go from extremely specific and verb-dependent, to extremely

general. For instance, in a more specific definition, the verb attack can have the semantic roles

of the attacker and the victim, and the verb operate can have the roles of doctor and patient,

while in a more general definition both verbs have instead the roles of agent and patient [20].

There are three mainly accepted definitions of semantic classes, associated to the projects

FrameNet [21], PropBank [22], and VerbNet [23].

A normal SRL system is composed by a mixture of several other components, in a modular

architecture. First, it has a syntactic parser that is responsible for classifying phrases in the

7

sentence into syntactic categories. Then, it uses a predicate identification and classification

component, to search for the existing predicates. In Figure 2.2, the predicates are become,

did and program. Next, an argument identification component is used to find the arguments for

each predicate. In a last step, the classification of the arguments is done separately from the

previous process, using an argument classification component. An implementation of an SRL

system using the aforementioned model is discussed in [22] and [24].

Although dependency parsing and semantic role labelling are different in the way that they

map to different structures (i.e., dependency graphs map sentences to syntactic dependency

graphs, while semantic roles focus on individual predicates), the results are partially overlap-

ping. For instance, syntactic dependencies may be the same as the semantic roles for those

two words. It is possible to see the similarities in Figures 2.1 and 2.2. This means that it may

be possible to use both methods together to improve the accuracy of a classification system

that attempts to interpret sentences for information extraction [25].

Root

became/VBD

because/IN

did/VBD

have/VB

program/NN

space/NNa/DT

NMOD NMOD

OBJ

n’t/RBthey/PRP

SBJ ADV V C

SUB

extinct/JJdinosaurs/NNS

The/DT

NMOD

SBJ PRDPRP

ROOT

Figure 2.2: Semantic role labelling of the sentence The dinosaurs became extinct because theydidn’t have a space program. Words are marked with their respective POS tag, and arcs pointto the grammatical function the node has to the head. SBJ: Subject; NMOD: Nominal modifica-tion; PRD: Second predicate; PRP: Purposer; SUB: Subordinate; ADV : Adverbial modification;VC: Verb Complement; OBJ: Object.

8

2.3 Stanford CoreNLP

Stanford CoreNLP is a suite of several natural language processing and analysis tools [26].

Among others, it has an annotator for each of the concepts described above. This software

also works as an integrated framework. For any desired process, it receives an input text

and uses any annotators necessary, layer by layer, to generate the output. For instance, to

use the dependency parser, Stanford Core NLP calls the annotator tokenize, to divide the text

into tokens, followed by the annotator ssplit, that will split the sentence into chunks, and the

annotator pos, responsible for applying POS tags, and finally executing the dependency parse

annotator.

The dependency parser in particular will be used in Chapter 4. This dependency parser is

a data-driven transition-based parser, trained using neural networks [27] and that uses a set of

dependencies from the Universal Dependencies project [28].

2.4 Open Information Extraction Tools

The following section will describe and compare four Open IE tools, which will be chronologi-

cally presented: ReVerb, OLLIE, OpenIE 4 and Stanford OIE.

2.4.1 ReVerb

ReVerb [4] is an extractor that treats relation extraction as a constraint satisfaction problem.

Previous extractors were based on classifiers, and produced two frequent types of errors,

namely incoherent extractions and uninformative extractions. Incoherent extractions are ex-

tractions that do not have any meaningful interpretation. For instance, the sentence “Gengis

Kahn conquest was central to the connection between Asian and European cultures” retrieves

the extraction (was, central, connection), which is incoherent. Uninformative extractions are

extractions that omit critical information. The sentence “Joana gave birth to a boy ” produces

the extraction (Joana, gave, birth to), which is uninformative.

The motivation for solving those errors arises from two intuitive observations: first, we have

that a great percentage of the extracted relations follow a simple pattern, meaning that it may

be possible to generalize a set of rules to search a corpus; second, there are two main kinds of

extraction errors that occur frequently and may be solved by using constraints.

ReVerb imposes two constraints on the results. The first, a syntactic constraint, defines

a regular expression pattern of POS tags, seen in Table 2.3. Any potential relation must be

in the form of one of the patterns. If one relation matches to more than one pattern, the

9

V | VP | VW*P

V = verb particle

W = ( noun|adj|adv|pron|det )

P = (prep|particle )

Table 2.3: Pattern used by ReVerb to search for relations. Every relation must be matched withone of the following rules. ReVerb searches for the longest match it can find, i.e., it gives priorityto the rule VP over the rule V.

longest is chosen. This method may sometimes extract overly specific relations, that may not

be useful. To solve this problem, ReVerb uses a second lexical constraint to evaluate if an

extracted relation is capable of being used with other arguments. This is based on the intuition

that a valid relation can take several distinct arguments in a large corpus. To do this, a dictionary

is constructed in an offline step, with a big set of possible relation phrases. The relation being

evaluated is then matched against the dictionary and is only extracted if it exists in there.

The dictionary is built by applying the pattern in Table 2.3 to a corpus. Then, the authors

compute the frequency of relation phrases that use different arguments, and only those with a

value above a predetermined threshold are added to the dictionary.

To extract relations, ReVerb first POS tags and NP chunks the corpus being evaluated.

Then, it uses the above constraints to find and validate relation phrases. If two relations are

contiguous, they are merged. Then, it searches for the relation’s arguments. It is necessary to

find the left and the right arguments, and the boundaries to each. ReVerb uses heuristics to

determine them, based on their POS tags. If a valid relation and both arguments are found, the

system returns an extraction. In a final step, ReVerb uses a confidence function, based on a

logistic regression learner, to attribute a confidence to the extraction.

2.4.2 OLLIE

OLLIE [5] is an evolution of ReVerb, addressing its main flaw, which is the fact the system can

only detect relations mediated by verb patterns. OLLIE also recognizes that some relations

aren’t always factual (e.g., I think that Portugal is a Spanish City) but instead based on beliefs,

hypothetical context, etc.

To solve these limitations, OLLIE expands the syntactic scope of possible relation phrases,

to cover a much larger number of relation expressions. It changes the Open IE representation

to allow additional context information, such as clausal modifiers.

OLLIE’s algorithm is divided into a learning phase and an extraction phase. To learn which

relation patterns are used, OLLIE bootstrapps a large quantity of seed tuples from ReVerb, un-

10

der the assumption that almost all relations can be expressed in a verb like way. For instance

the sentence “The actor John Malkovich (...)” produces the relation (John Malkovich; is; an

actor). Then, OLLIE uses the seed tuples to learn patterns using dependency parsing, which

often appears in a verb mediated form, although this is not the case on the particular example. It

then learns open pattern templates - a mapping from a dependency path to an open extraction,

identifying both the arguments and the exact ReVerb-style relation phrase. These patterns are

ranked in frequency. When the time comes to extract relations, OLLIE simply builds a depen-

dency parse for the sentence and expands the nodes as far as possible to match the patterns

learned before. In the extraction process, there is an additional step that analyzes the context of

the relations, to solve the problem of non-factual relations. OLLIE uses the dependency parse

structure to analyze key nodes. Words like believe, imagine or if are flagged and analyzed and,

if it is the case, OLLIE marks them with a clausal modifier to turn the sentence factual. When

OLLIE cannot give veracity to the sentence with this addition, it just gives it a low confidence.

2.4.3 OpenIE 4

The last iteration of the KnowItAll Project (where ReVerb and OLLIE belong to) is OpenIE 4 [6].

The biggest change introduced by this system was to break the old model of binary relations,

enabling an n-ary extraction from a sentence. N-ary extractions can have 1 or more second

arguments. For instance the sentence “The U.S. president Barack Obama gave his speech on

Tuesday to thousands of people” can be converted to the extractions (Barack Obama, gave,

[his speech, on Tuesday, to thousands of people]).

OpenIE 4 works by using two different main components: SRLIE and RelNoun. The first

component, SRLIE, is the part responsible for identifying n-ary extractions. It builds extractions

using Semantic Role Labelling frames. To do this, it processes each sentence through a depen-

dency parser. Then, it takes the resultant dependency graph and feeds it to a SRL system to

produce the SRL frames. Finally, these frames go through SRLIE to produce n-ary extractions.

First it applies some filters heuristically to remove some SRL frames. Then it determines the

argument’s boundaries, and finally constructs the relation phrases. The second component,

RelNoun, is used to extract relations from noun-phrases, like Queen vocalist Freddy Mercury.

OpenIE 4 also further improves OLLIE’s clausal modifiers, taking into account conditional sen-

tences, and checking if a sentence has a negative or positive assertion.

11

2.4.4 Stanford Open Information Extraction

The Stanford OIE system [7] is the most recent from all the tools analyzed in this section,

published in 2015. This system differs from the others in its methodology: instead of trying

to learn more complex patterns, and using techniques to recognise each time more complex

relations, the main focus of this system is to first simplify the sentences, so in the end there

is only the need to use a small amount of patterns to recognize most of the relations. In a

first step, it preprocesses the sentences to produce smaller, independent and coherent clauses

(i.e., clauses that can stand on their own syntactically and semantically), that are entailed by the

main sentence. To accomplish this, the system traverses a dependency parsing tree, labelling

each arc with one of three actions that indicate the steps that must be taken for the arc and

correspondent subtree. These actions can be:

• Stop: if the subtree below the current arc is not entailed by the parent sentence (the

common action for leaf nodes, for instance).

• Yield: if the system found a new clause that is entailed by the parent sentence.

• Recurse: that points that the subtree has potential to have a clause, but it did not found

one yet.

Additionally, sometimes there is a relation between the parent and the child clause (only

when there is an action of yield or recurse), and three additional actions can be taken to express

that relation:

• Subject Controller: if the current arc is not already a subject arc, the subject of the

parent node is copied and attached as the subject of the child node.

• Object Controller: does the same as the previous action but takes the parent node as

the object instead.

• Parent Subject: if the current arc is the only outgoing arc from a node, the parent node

is assigned as the passive subject of the child.

Using this algorithm, the system trains a multinomial logistic regression classifier using a

labeled dataset, and learns the sequences of actions that produce relations by applying the

algorithm to the dataset. The algorithm uses the extracted relations that match the labeled

relations as positive examples, and the sequences of actions that correctly do not produce any

relations as negative examples.

12

In a second step, the system will try to use the produced clauses (after running the classifier

on a corpus to be evaluated) to generate a maximally compact sentence that retains the core

semantics of the original sentence, i.e. the smallest possible sentence that is yielded by the

parent sentence and is semantically and syntactically correct. To do this, the system uses

rules of Natural Logic described in [29] to check what it can delete from the clauses. This

is done by checking if certain words induce downward or upward polarity (words such as all,

no, many, etc.). If a word, like all induces a more general scope for the sentence (e.g. if All

dogs bark., then it is true that All small dogs bark.), then it is said that the word all induces a

downward polarity. If, on the other hand, a word, like some, induces a more specific scope (e.g.

if Some yellow flowers smell good., then it is possible to say that Some flowers smell good.),

then it is said that the word some induces a upward polarity. The algorithm searches for these

words and classifies each arc into whether deleting the child of the arc makes the father more

general, more specific, or neither, and acts accordingly using constraints and heuristics to short

the sentences that are possible to shorten.

Finally, in a third step, the system uses 6 main patterns to extract relations from the sen-

tences. If the system detects a compound noun (i.e. a noun-phrase that may hold a relation) it

uses 8 more patterns to try to extract that relation [7].

13

14

Chapter 3

Introductory Analysis

This chapter presents an introductory study of the four Open IE tools that were described in the

previous chapter, and an analysis of transcribed dialogue text. OpenIE 4 and Stanford OIE are

the state-of-the-art extractors. ReVerb and OLLIE are older systems. Nevertheless, there are

critical differences in the algorithms used by each system. Considering that there is still a lack

of studies of Open IE applied to dialogues, it is possible that those critical differences may have

an impact on the extractions, ergo our decision to compare the four tools.

The chapter starts with the study and construction of the dataset that we will use. Section

3.2 follows with an explanation of the evaluation method and a description of the metrics used.

Section 3.3 details the context in which each tool was evaluated, and establishes a benchmark

for further studies, by showing the results of the same tools when applied to our dataset. In

section 3.4, we conduct an analysis of the problems observed in the initial study. Finally, in

section 3.5, we present the characteristics of the system proposed to improve the problems

encountered.

3.1 Dataset Study

The dataset used to carry out this study is the Switchboard Dialogue Act Corpus [30]. This

dataset contains a collection of 1157 two-sided telephone conversations, among 543 speakers.

The conversations were transcribed and annotated. The documents have tags that identify

speech acts, such as sounds, disfluencies, and repetitions.

To perform our studies, we chose 8 conversations at random. Open IE systems usually work

at a sentence level, i.e., they analyze one sentence at a time in search for relations. In order

to comply with that methodology, each file was pre-processed to remove major annotations,

and to be structured by sentence. This process generated 500 proper sentences. Chapter 4

15

explains the pre-processing algorithm in further detail.

We manually labeled the sentences by creating a structure where, for each sentence, we

identified the possible relations, and for each relation we separately tagged the first argument,

the relation phrase, and the second argument. There were sentences that did not contain

information to produce an extraction, e.g., “No way, John!”. These sentences were kept, as

they are part of the dataset, and were annotated with no relations. In this study we also did not

consider questions, for the reason that neither of the four Open IE tools is prepared to process

the grammatical structure of a question. To execute our tests, we provided the unlabeled set of

500 sentences to each extractor, and compared the results with the labeled set.

3.2 Evaluation Method

To properly conduct a comparative experiment between all four Open IE applications, it is nec-

essary to understand what kind of information they produce.

Figure 3.1 shows an example of a relation extracted, and its confidence value. For an

extraction to be valid, it must have both arguments of the relation, the relation phrase, and a

confidence value. It is possible, and actually frequent, to have more than one extraction per

sentence, each one with its own confidence value.

0.876 (I ; have ; two dogs)

Figure 3.1: Example of an extraction taken from the sentence “I have two dogs”, with its confi-dence value. The first and second arguments are, respectively, I and two dogs, and the relationphrase is have.

The confidence value is used to attempt to validate the relation extracted in unsupervised

tests. It varies between 0 and 1 and is measured using a classifier. The classifier is specific

to each application and is trained with a set of features related to the different methods used

by them. For instance, considering that ReVerb is based on syntactic patterns to recognize

relations, it uses features such as the number of words and existence of specific conjunctions,

while Stanford OIE takes features from the dependency tree it uses to perform the extraction,

such as the label of edges being taken, and POS tags from edge endpoints.

We classified each relation extracted either as correct or incorrect. A relation is considered

correct if it satisfies the following criteria:

• It must be a valid extraction, according to the definition presented previously;

• It must be syntactic and semantically correct;

16

• It must be informative, i.e., it must contain all the information necessary related to the

event extracted; For example, the extraction (Peter; went; to look) taken from the sentence

“Peter went to look for a house” is not informative, thus, is classified as incorrect; However,

the extraction (Peter; went to look; for a house) is classified as correct;

• It must be locally factual, i.e., it must not contradict or change the information present in

the sentence. For instance, if the sentence “The moon is a star ” produces the extraction

(The moon; is; a star), this extraction is correct, according to the information present in

the sentence. However, the extraction (The earth; was; flat), taken from the sentence

“Ancient cultures believed that the earth was flat”, is not correct, since that is not the

information stated in the sentence.

If an extraction does not comply with all the above criteria, then it is classified as incorrect.

There may be several correct extractions for the same relation, differing only on one or two

words. We also counted the number of possible relations to which no correct extraction was

produced.

For a classification task such as this, there are four kinds of possible results:

• True Positives (TP): extractions that were classified as correct.

• True Negatives (TN): absence of extractions when there is none expected. We do not

count those results here.

• False Positives (FP): extractions that were classified as incorrect.

• False Negatives (FN): sentences where there should have been at least an extraction,

but there was none.

By using these values, we computed the metrics of precision, recall, and F1-score. Pre-

cision is the percentage of extractions that were classified as correct, from all the extractions

obtained:

Precision = TP/(TP + FP )

Recall is the percentage of correct extractions obtained from the universe of all possible

correct extractions in the dataset:

Recall = TP/(TP + FN)

17

F1-score is a weighted average of precision and recall, that measures an overall score of

the system:

F1 = 2 ∗ (Precision ∗Recall)/(Precision+Recall)

The confidence value can be used in unsupervised tests to self-validate an extraction. It

is possible to establish a division point by defining a threshold, e.g., we can predict that all

extractions with confidence above 0.5 are correct, whether it is true or not. And by varying this

threshold, it is possible to increase one of the metrics in detriment of the other, e.g., by lowering

the threshold, we can increase recall at the cost of lower precision. To measure this trade-off,

we will plot a Precision-Recall curve. The curve is computed by providing the confidence value

and the binary classification of each extraction. With this curve plotted, we can calculate the

Area Under the Curve (AUC) [31], which measures the space in the graph below the curve. The

AUC varies between 0 and 1. An higher AUC denotes a better trade-off, i.e., more extractions

are classified as correct at progressively lower confidence values.

3.3 Open IE: A Comparative Study

This section provides a comparative study between ReVerb, OLLIE, Stanford OIE and OpenIE

4. First, it describes the context in which the extractors were evaluated, as well as the results

they reported. Then, it applies the four tools to the test dataset, and analyzes the obtained

results, creating a ground truth for further tests.

3.3.1 Reported Results

In the original publications, both ReVerb and OLLIE were evaluated by human annotators.

ReVerb took extractions from a set of 500 sentences, and OLLIE used a set of 300 sentences.

For both cases, two human annotators classified each extraction as correct or incorrect, and

the final evaluations were taken from the extractions they agreed upon. ReVerb reported a

precision of 0.75, yet no value for recall. OLLIE reported a precision of 0.75, and a recall of 0.5.

OLLIE also stated that it was able to extract 4.4 times more correct relations than ReVerb, and

that it achieved an AUC 2.7 times superior [4, 5].

Stanford OIE and OpenIE 4 took a different approach. Both systems participated in the Text

Analysis Conference: Knowledge Base Population (TAC KBP) Slot Filling Challenge 20131,

which offers, among other challenges, a task to fill a predefined schema of relations, extracted

from a large dataset. The relations are evaluated based on a set of query entities. OLLIE, the

1http://tac.nist.gov/2013/KBP/

18

state-of-the-art system at that time, was also used in the challenge, for comparison. The results

from the TAC KBP Challenge are presented in Table 3.1 [7, 6].

Precision Recall

OLLIE 0.577 0.118

Stanford OIE 0.586 0.186

OpenIE 4 0.698 0.114

Table 3.1: Precision and Recall values reported at the TAC KBP Slot Filling Challenge 2013.

From these results, we can observe that OpenIE 4 is currently the system with higher pre-

cision, while Stanford OIE displays the higher recall.

3.3.2 Benchmarks

The conditions and the dataset used in the TAC KBP challenge are different from our dataset,

and from the extractions of dialogue transcriptions that we intend to study. As a consequence,

the results shown in table 3.1 are not comparable to our evaluation method, and should be

used only as an extrinsic reference. Therefore, it is necessary to calculate a ground truth, i.e.,

a baseline of results obtained by applying the four extractors to our dataset, before conducting

further studies.

Table 3.2 shows the precision, recall, and F1-score obtained with the first tests, along with

the number of extractions. If we take into account the results previously reported, both precision

and recall denote a low score overall.

Extractions Precision Recall F1-score

ReVerb 760 0.36 0.12 0.18

OLLIE 1030 0.37 0.17 0.23

Stanford OIE 1415 0.27 0.24 0.25

OpenIE 4 1123 0.42 0.15 0.22

Table 3.2: Initial results obtained when testing the four extractors with raw dialogue.

ReVerb is the system with less extractions, and the lowest recall and F1-score. OpenIE

4 has the highest precision, yet it shows the second lowest value for recall and F1-score.

OLLIE shows balanced results, compared with the other tools. Stanford OIE has the highest

recall, and highest number of relations extracted, despite showing the lowest precision. We

observed that this larger number of extractions causes Stanford OIE to have larger percentage

of incorrect extractions, thus lowering the precision. Yet, at the same time, it produces more

19

correct extractions than the others, increasing the recall.

0.0 0.1 0.2 0.3 0.4 0.5Recall

0.0

0.2

0.4

0.6

0.8

1.0Pre

cisi

on

ReVerb - AUC = 0.081OLLIE - AUC = 0.154Stanford OIE - AUC = 0.152OpenIE 4 - AUC = 0.164

Figure 3.2: Precision-Recall curve for every extractor, along with the Area Under the Curve(AUC).

Figure 3.2 displays a comparison between the Precision-Recall curve of the four Open IE

systems, together with the respective AUC. The curves demonstrate the values of precision

and recall for a variation of the threshold associated with the confidence value. For instance,

in OLLIE’s curve (in red), all the extractions with a confidence value superior to 0.85 were

classified as correct, thus the curve being a straight line, with precision equal to 1. However,

all the other extractions with lower confidence were considered incorrect, and thus the recall

of 0.1, as only a small percentage of all the possible extractions was considered correct. By

lowering progressively the threshold, we can see an improvement of the recall, in exchange for

a decrease in precision. This trade-off is observed by the evolution of the curves, with a higher

curve implying a better relation. We quantify the trade-off by measuring AUC, i.e., the size of

the area below the curve.

OpenIE 4 shows the best relation, with an AUC of 0.164. We can also observe that the

decline of the curve is regular, denoting that there are progressively more incorrect extractions

as the confidence is reduced, instead of a sudden lack of correct extractions at a tipping point.

The other three extractors show precisely this situation. OLLIE, for instance, has a flat line until

a confidence value of approximately 0.85, declining suddenly after that point. ReVerb has a

decrease early in the graphic, showing that a large number of extractions classified as incorrect

20

had a high confidence value. It has also the lowest AUC, which is less than half of Open IE’s

AUC. Stanford OIE has a local minimum early in the graphic. We observed that this situation is

influenced by Stanford OIE extracting more sentences than the other three systems, as in this

case it was extracting multiple instances of incorrect relations, with high confidence.

The results presented in this section constitute the benchmark for our subsequent studies

and experimentations.

3.4 Dataset Problems

In order to understand the results obtained by each extractor, we analyzed the incorrect extrac-

tions and studied the characteristics of the sentences from which they were generated.

A preliminary observation shows that the size of the sentence is one of the major prob-

lems. Larger sentences originated more incorrect extractions. Likewise, a large percentage of

incorrect extractions was generated from sentences with a bad structure, or with noise, e.g.,

sentences that showed repetitions, interruptions, changes of topic mid-sentence, and grammat-

ically malformed sentences, among others.

A more thorough analysis was conducted to understand and define the kinds of problems

encountered, and to measure their frequency. From the preliminary analysis, we will consider

two types of problems: noise and size of sentences. 32% of extractions classified as incorrect

had noise related problems, while 59% were originated from long sentences.

Noise

We define Noise as grammatical malformations, or expressions that add no value to the

sentence. For instance, a sentence that has a repetition is grammatically malformed. On the

other hand, in the sentence “We went to the store and stuff”, the expression and stuff adds

no value to the information present in the sentence. Table 3.3 summarizes the most frequent

kinds of noise observed in the extractions with noise related errors. It is possible to see that

repetitions occur frequently, being present in 43% of those incorrect extractions.

Relatively to interruptions, we noticed that they are divided in two cases. The first case is

an interruption that consists in a commentary, or an agreement sound of one speaker, while

the other is talking (e.g., I see, hm-hmm). These interruptions usually do not shift the theme

of the sentences interrupted, i.e., the first speaker continues to talk about his original subject.

The second kind of interruption consists in a shift of subject, e.g., “(...) and then we went to the

boat and... Hey, look at the time! We must be going“. While the first case may provide a com-

plete sentence with sufficient information for an extraction, the second case has an incomplete

21

Noise Frequency Example

Repetitions 43% ”and then, then he...”

Meaningless expressions 28% ”and stuff like that”

Interruptions 16%”By the time we...”

”What about your car?”

Others 13% ”We are, you see, in spite of.”

Table 3.3: Frequency of different kinds of noise in the incorrect extractions with noise relatedproblems.

sentence, and it usually will not be possible to retrieve a relation from that sentence.

Long Sentences

We observed that the long sentences that originated incorrect extractions were sentences

with more than one clause. Grammatically, those sentences are divided into two groups:

• Compound Sentences: defined as having two independent clauses, i.e., clauses that

can stand alone syntactically (e.g., “Peter went to the mall and Maria stayed home.”)

• Complex Sentences: defined as having an independent clause and at least a dependent

one, i.e., a clause without meaning if stated alone (e.g., “Peter went to the mall in the

afternoon.”)

Compound sentences are connected by a coordinating conjunction (e.g. and, for, so, yet).

Complex sentences are usually connected by a subordinating conjunction (e.g. when, because,

although, after ). Table 3.4 summarizes the most frequent kinds of subordinations found in the

incorrect extractions caused by long sentences. The subordination type can be recognized by

the conjunction, underlined in the table.

Subordinate Type Frequency Example

Temporal 39% ”I went home when the rain stopped.”

Causal 27% ”My mother went on vacations because she needed a rest.“

Conditional 9% ”Maria will be on time if there’s no traffic.”

Comparative 5% ”She earns as much money as I do”

Others 13% ”Even though he studied a lot, he didn’t pass the exam.”

Table 3.4: Frequency and examples of different kinds of subordinations in incorrect extractionsoriginated from long sentences. Conjunctions are marked in bold.

It is possible to observe that there is a clear prevalence of two types of subordinations: Tem-

poral and Causal. The Others category includes types with smaller appearance frequency,

22

and types that are ambiguous or hard to identify.

Compound sentences commonly have two relations in them. In the example “Peter went to

the mall and Maria stayed home.“, we can extract the relations (Peter; went to; the mall) and

(Maria; stayed; home). Complex sentences mostly only have one event, and thus one relation.

In the example “Peter went to the mall in the afternoon.”, the relation (Peter; went to; the mall)

is still correct. However, the relation (Peter; went to; the mall in the afternoon) is also correct,

as it does not contradict any of our criteria. This ambiguity in the definition of the arguments is

one of the causes of errors in the extractions.

Other Problems

Finally, from the remaining 9% of incorrect extractions, we observed that a frequent occur-

rence was related to the local factual validity, i.e., the extractions where the relation generated

does not correspond to the truth present in the sentence (as seen in the example “Ancient

cultures believed that the earth was flat”). In this case, the sentence has an expression that

we call Belief Modifier, that changes the validity of the event of the sentence. In the previous

example, the belief modifier is “Ancient cultures believed”.

To summarize this study, by analyzing the extractions classified as incorrect, we identified the

following problems in the respective sentences:

• Sentences with noise;

• Long sentences;

• Contradictions between the relation and the sentence.

3.5 Strategy Definition

Based on the previous study, it is now necessary to develop a strategy to attempt to solve those

problems.

In order to decrease the noise, our strategy will be to remove the responsible occurrences of

repetitions and meaningless expressions. As for interruptions, we will treat only the first case,

where there is a continuation of the theme in the sentence after the interruption, and remove

the expression in question. The second case is usually not possible to solve as the information

to complete the interrupted sentence does not exist. With regards to errors in long sentences,

our strategy will be to simplify those instances.

23

In the case of compound sentences, we will split both clauses and create two new subsen-

tences, considering that each clause is independent, and grammatically correct by itself.

As for complex sentences, we will isolate the main event of the sentence, and retrieve the

complementary information given by the dependent clauses. Nonetheless, this information may

be important to the relation, and even necessary to validate it. Therefore, to keep this informa-

tion, we will implement a structure capable of storing both the relation and the complementary

information. The rationale is that by removing these dependent clauses from the sentences

beforehand, the resultant simplifications will be easier to be processed by the extractors. We

will focus our implementation on the three most frequent types of clauses observed in the study

above, which are Temporal, Causal, and Conditional. Coincidentally, this solution is able to take

into account the third problem, corresponding to Belief Modifiers.

This strategy is based in some mechanisms that the Open IE tools already perform. Stan-

ford OIE extracts its relations by first simplifying the sentences. OLLIE and OpenIE 4 extract

some complementary data alongside the relation.

24

Chapter 4

Implementation

This chapter describes the implementation of the system proposed to improve the results in

all four tools, based to the study realized in the previous Section. The chapter starts with an

explanation of the objectives of the application in Section 4.1, followed by the definition of the

input requisites of the system, and the definition of the output format in Section 4.2. Sections

4.3 to 4.5 present a detailed description of every module implemented.

4.1 Overview

As studied in the previous chapter, the major problems observed in our dialogue dataset are

noise, sentence length, and contradictions between the relations and the sentences. In Section

3.5 we defined that the strategy to solve these problems is to remove the noise; to simplify the

sentences; and to store complementary information into an organized structure. As such, our

system is divided into the following five modules:

• Pre-Processor: tasked with adapting the dataset to comply with the input requisites of

the system. It is external to the main pipeline and should be changed in case another

dataset is to be evaluated;

• Sentence Cleaner: used to clean the input data from noise and irrelevant information;

• Sentence Divider: concerned about dividing sentences into subsentences, and simplify-

ing the sentences retrieving complementary information;

• Relation Extractor: responsible for setting and calling the four extractors used in the final

step of the pipeline;

25

• Output Structure Organizer: a module whose task is to store the information generated

in the output of the different Open IE tools to a XML structure.

The system, in Figure 4.1, will work as follows: a pre-processed file will be provided to the

software. The data will pass through the modules sequentially, first cleaning the sentence, then

dividing and simplifying it, and finally extracting its relations. The output organizer will work in

parallel, structuring the information as it is processed. The output data is the XML structure

produced by the Output Organizer.

Pre-Processor

SentenceCleaner

SentenceCleaner

SentenceDivider

RelationExtractor

Output StructureOrganizer

InputData

OutputData

Figure 4.1: Arquitecture of our application. The three main modules process the data in se-quence, while the Output Structure Organizer module works in parallel with them.

4.2 Input Requisites and Output

Before describing the implementation of our system, it is important to define what data can be

used as input, and how the output should be formatted. This section defines the requisites and

the output of our application.

4.2.1 Input Requisites

Every Open IE software that was analyzed operates at a sentence level, which means that all

of the four Open IE tools will extract relations for one sentence at a time.

In this context, a sentence is defined by a set of one or more words, from a single speaker,

textually written with an initial capital letter and an ending punctuation. No other characters and

annotations outside of the context of the transcription are allowed (e.g., an exclamation mark

is part of the context of the speech, but a pair of brackets indicating repetition is not allowed).

Therefore, for data to be accepted by this system, it must be arranged in a sentence-based

structure, i.e., each input file must have a single sentence by line. The subsequent description

of the system will be done at a sentence level.

26

Switchboard is an annotated transcription of dialogues, which means that it is labeled with

meta-information, used to indicate details such as utterance indexes, or existent disfluencies.

These labels are not necessary for our application, hence they were removed.

The next step is to normalize the sentences, by capitalizing every first word, and adding a

full stop where there is a lack of final punctuation. This step is important in the sense that some

extractors use these tokens to aid the recognition process. Nevertheless, we only execute this

step here in tests that do not use the Sentence Cleaner module. Otherwise, we only normalize

the sentences after the Sentence Cleaner, because the lack of final punctuation is crucial in

recognizing interruptions.

4.2.2 Output

In order to store the complementary information that will be retrieved from the sentences, before

the extraction, we created the structure shown in Figure 4.2.

1 <r e l a t i o n s f i l e = ” f i l e −name”>2 <sentence i d = ” 12 ”>3 < t e x t>Last year I thought i t was a good s a c r i f i c e to make , because I

wanted to spend some t ime wi th my k ids .< / t e x t>4 <s i m p l i f i c a t i o n i d = ” 0 ”>5 < t e x t> I t was a good s a c r i f i c e to make .< / t e x t>6 <TC>Last year< /TC>7 <BM> I though< /BM>8 <e x t r a c t i o n i d = ” 0 ” conf idence= ” 0.756 ” c l a s s i f i c a t i o n = ” 1 ”>9 <arg1> i t< / arg1>

10 < r e l>was< / r e l>11 <arg2>a good s a c r i f i c e to make< / arg2>12 < / e x t r a c t i o n>13 < / s i m p l i f i c a t i o n>14 <s i m p l i f i c a t i o n i d = ” 1 ”>15 < t e x t> I wanted to spend some t ime wi th my k ids .< / t e x t>16 <DC>Because< /DC>17 <e x t r a c t i o n i d = ” 0 ” conf idence= ” 0.802 ” c l a s s i f i c a t i o n = ” 1 ”>18 <arg1> I< / arg1>19 < r e l>wanted to spend< / r e l>20 <arg2>some t ime wi th my k ids < / arg2>21 < / e x t r a c t i o n>22 < / s i m p l i f i c a t i o n>23 < / sentence>24 < / r e l a t i o n s>

Figure 4.2: Example of the output xml structure for the sentence “Last year I though it was agood sacrifice to make, because I wanted to spend some time with my kids.”

This structure consists of an XML file produced for every file evaluated, for each tool used.

Each XML file will have a list of sentence. The sentence stated is the original, taken from the

pre-processed file. After a sentence goes through each module, every sub-sentence created

is inserted into a simplification element, and if there are any secondary elements retrieved,

27

they are inserted in the respective entries. The four possible elements are: TC for Temporal

Context; BM for Belief Modifier; DC for Direct Consequence; and CC for Condition. Later, for

each extraction found, an extraction element is created, containing the three elements of

a relation, i.e., arg1, rel, and arg2, as well as the confidence value obtained by the tool,

and a binary classification value, used for evaluation. It is possible for a simplification to not

have any extractions, in which case no extraction element is created. If a sentence has no

simplifications, a simplification element is created with the text being equal to the sentence.

4.3 Sentence Cleaner

Sentence Cleaner is the module responsible for removing the noise. We identified noise as

repetitions, meaningless expressions, and interruptions.

Repetitions are removed with a regular expression. If the module identifies two equal words,

one after the other, with a possible punctuation signal in between, it removes one of the occur-

rences.

Meaningless expressions are identified, and consequently removed, by using a dictionary.

We manually compiled this dictionary using every meaningless expression found in the dataset,

together with possible variations. This dictionary contains expressions such as “and every-

thing“, “or something”, and “that’s good“.

Interruptions are recognized by the absence of final punctuation in a sentence. We restate

that there are two kinds of interruptions: the case where the theme of the sentence is not

shifted, as when an agreement sound is made; and the case where the theme shifts and the

sentence becomes incomplete. We only act upon the first case. To identify the correct cases,

the module searches for lines that do not end with any final punctuation. Then, it evaluates the

next sentence. If this sentence has a size equal to, or smaller than three words, the sentence

is removed, and the interrupted sentence is joined to the next sentence in line. If the number of

words is greater than three, we consider that the interruption may have enough information to

be considered a second case, and we ignore that sentence. Figure 4.3 shows an example of

this procedure.

>Friday we went downtown>Friday we went downtown to

dine at that new restaurant.>Right.>to dine at that new restaurant.

Figure 4.3: Example of an interruption being identified and resolved. Each > indicates a dif-ferent sentence. On the left, the second speaker interrupted the first with a comment. Thealgorithm identifies that sentence with less than three words, so it removes it and joins the firstand third sentences, creating the sentence seen on the right.

28

The last step in this module is to normalize the sentences, with a process similar to the one

described in 4.2.1. The objective is to capitalize every sentence, add a final punctuation where

there is none, and to correct errors that may have been created with the removal of content,

such as consecutive commas.

4.4 Sentence Divider

In Section 3.5, we defined two objectives for simplifying phrases: dividing compound sentences

into subsentences; and extracting complementary information from dependent clauses, in com-

plex sentences.

Compound sentences have two independent clauses, and are connected by a coordinat-

ing conjunction, e.g., “Joseph was christian and Patricia was buddhist”. Part of the division

algorithm is based on [32].

The complementary information sentences to be extracted from complex sentences is:

• Temporal Context: a reference to the time the action is being made (e.g., “We’ll play

football next thursday.”);

• Condition: an action necessary to validate a sentence (e.g., “My mother will go to the

beach if it doesn’t rain.”)

• Cause: the reason that provoked the main event of the sentence (e.g., “I don’t eat broccoli

because it stinks!’)

• Belief Modifier: statements that may change the truth of the sentence, according to their

context (e.g., “I don’t believe that Jack will finish the master’s.”)

To help identify these speech components, we use the Stanford CoreNLP software [26],

specifically the POS tagger and the dependency parser. Thus, before executing the next steps,

the Stanford Dependency Parser is used in each sentence, producing the necessary POS tags

and dependencies.

Using the cleaned sentences generated in the previous module, each sentence goes through

each step. First, the algorithm tries to divide the sentence. Each obtained subsentence is pro-

vided back to the first step, until it is not possible to divide the sentences anymore. Then, the

algorithm searches for the complementary information, reducing each subsentence to its main

components.

29

4.4.1 Coordinate Sentences

Coordinate sentences are recognizable by a coordinating conjunction connecting both clauses.

The dependency parser labels this conjunction with the POS tag CC, and labels the depen-

dency as CONJ. However, a coordinating conjunction is present in other syntactic elements, like

enumerations (e.g. “John went to the store to buy oranges and bananas”).

To divide only the correct sentences, the algorithm searches the tree for a CONJ depen-

dency. If the dependent is not a verb, the algorithm ignores the dependency and keeps search-

ing for another CONJ. If the dependent is a verb, the algorithm checks first if the length of the

possible subsentence is larger than one word, otherwise it will not be a valid sentence. If the

length is larger than one, the algorithm generates both subsentences, dividing at the <CC>

node (and removing it from each sentence, adding to the external structure). Finally, the al-

gorithm checks for the existence of a subject in the second subsentence, labeled by NSUBJ. If

there is not one, the NSUBJ from the main sentence is used.

4.4.2 Causal Sentences

Causal sentences are identified by a set of causal conjunctions, such as because, so that, in

order to, and variations of those. We compiled all the causal conjunctions in a dictionary. The

algorithm uses a regular expression to search for the causal conjunctions in the sentence. If

the conjunction is found in the middle of the sentence, the clause starting at that conjunction is

extracted and added to the XML structure with the tag of <DC>, for direct consequence. If the

conjunction is found in the beginning of the sentence, the algorithm searches for a comma, to

be able to limit both clauses. If it finds a comma, retrieves the clause from the conjunction to

the comma. Otherwise, the clause is not retrieved.

4.4.3 Temporal Context

The search for a temporal context is done in two steps, considering that it may exist under

different forms.

The first step is to use the dependency tree for the sentence. The dependency parser

labels possible temporal clauses with ADVCL, although this label is used for any kind of clause

modifier. To account for this, the algorithm verifies if the dependency is connected to a set of

temporal conjunctions (i.e., when, after, until, before, since, while, or once), that were compiled

in a dictionary. If the dependent is a temporal conjunction, the subtree with the temporal clause

is retrieved and added to the XML structure with the tag <TC>, for temporal context.

30

Nevertheless, the group of temporal references that is not mediated by a temporal conjunc-

tion, is not recognized as a temporal clause by the parser. To find those references, we use

a dictionary in conjunction with regular expressions, that take into account different kinds of

time references (e.g. three years ago, last saturday, tomorrow at noon, now, everyday, about

a week, etc.). The algorithm uses different combinations of patterns, such as a proposition

followed by a day of the week, or a number followed by a time quantity. If any such temporal

reference is found, it is extracted from the sentence, and again added to the XML structure. It

is possible to have several temporal references for the same sentence.

4.4.4 Condition

A condition is defined by the existence of an if word, followed by a dependent clause. To

retrieve the conditional clause, the dependency tree is used to properly identify the limits of the

condition. The algorithm will search the dependency graph for a MARK dependency, and check if

the dependent word is if, due to the same reasons expressed in the previous step, as the MARK

label is used to identify several types of finite subordinate clauses, and not only conditional

clauses. If a condition is found, it is extracted from the sentence and added to the structure

under the tag <CC>.

4.4.5 Belief Modifier

Finally, the last step searches for belief modifiers, again using a library in association with a

regular expression. The algorithm searches for a pattern that matches a pronoun (e.g., he,

they, etc.) followed by a conjugation of a specific verb, taken from a dictionary. The dictionary

was compiled by finding all the occurrences of a belief modifier in the dataset, and retrieving

the verb and its conjugations. It has verbs such as think, know, say, find, and decide. It is

likely that these verbs are used in other cases besides belief modifiers, therefore the algorithm

applies restrictions to the search. Among others, the modifier must be in the beginning of the

sentence, and there must be at least another verb present.

4.5 Relation Extractor

The last module in our system is responsible for extracting the relations. After each go-

ing through the previous steps, the application will prepare every subsentence generated (or

just the original sentence, if it was not simplified) and insert them into each of the OpenIE

31

tools, namely into ReVerb, OLLIE, OpenIE4, and Stanford OIE. Each application is config-

ured to generate as many extractions as it can find for a given sentence, along with the confi-

dence value. The extraction is retrieved with the form confidence - (argument 1; relation

phrase; argument 2). For each obtained extraction, an <extraction> element is inserted in

the XML structure along with its confidence value, and every element of the extraction popu-

lates its respective field in the structure as well. Later, for evaluation purposes, every extraction

is manually classified as 1 if correct, or 0 if incorrect, according to the criteria defined in Section

3.2.

32

Chapter 5

Evaluation

In this chapter, we analyze the results obtained by our implementation. Section 5.1 presents

a detailed analysis of the values obtained, in comparison with the initial values. In Section 5.2

we evaluate the problems generated by our system. Finally, in Section 5.3 we discuss and

conclude the study present in this work.

5.1 Results

This section presents the results that were obtained after processing the set of 500 sentences

with our system. Table 5.1 compares the total number of extractions, the precision, recall, and

F1-score of every extractor when running on our dataset, compared to the values obtained in

the ground truth.

An initial observation denotes that, first, the number of extractions decreased, and second,

the values improved in every extractor. The fact that the extractors are processing smaller

quantities of information in each sentence justifies the first observation. In chapter 3 we con-

cluded that the extractors produced more errors with longer sentences, and more noise. Our

application focused on simplifying the data, by preserving the essential information to relation,

and storing complementary information outside the sentence. It follows directly that there is

less information to extract relations from, which amounts for less extractions. In our case, that

consequence may be positive, if the majority of the extractions removed was incorrect. The

second observation supports that idea, as every extractor improved its results.

ReVerb demonstrated the smaller progress, with an increase of 0.02 in precision, 0.06

in recall, and 0.06 in F1-score. The heuristics used by ReVerb (described in section 2.4) are

limited to a pattern, which restraints the range of possible relations it can extract. It was possible

to observe that reducing noise and complexity had an impact, but it was small compared with

33

Extractions Precision Recall F1-score

Before After Before After Before After Before After

ReVerb 760 545 0.36 0.38 0.12 0.18 0.18 0.24

OLLIE 1030 805 0.37 0.44 0.17 0.25 0.23 0.32

Stanford OIE 1415 1261 0.27 0.39 0.24 0.35 0.25 0.37

OpenIE4 1190 975 0.42 0.51 0.15 0.26 0.22 0.34

Table 5.1: Number of extractions, precision, recall, and F1-score obtained for the ground truth,and for processed dialogue. The best result in each section is marked in bold.

the other three tools.

OLLIE exhibits an improvement of 0.07 in precision, 0.08 in recall, and 0.09 in F1-score.

OLLIE reveals balanced results overall, with the second best precision and a recall and F1-

scores close to OpenIE 4. OLLIE is also based on patterns, although they are learnt instead

of defined manually. Our study showed that noise and complexity interfere with the recognition

of those patterns. Nevertheless, as they are learnt, it is possible that, if OLLIE is trained with a

bootstrap of seeds from dialogue text, OLLIE can adapt to better recognize dialogue patterns.

Stanford OIE denotes an increase in precision of 0.12 points, 0.11 in recall, and 0.12 in

F1-score. Despite recall being high in the ground truth, comparatively with the other extractors,

it still improved significantly. As for precision, in our baseline we observed that, by being the

tool that generates more extractions, it consequently produced a large number of incorrect ex-

tractions, having the lowest precision of the four. By using our system, Stanford OIE increased

its precision to 0.39, which was an increase of 44%. Stanford OIE searches for extractions by

applying its own method of simplification, attempting to divide the sentences until it recognizes

a relation. The sentence simplification is a process prone to errors. This study showed that a

focus on reducing noise and complexity can have a denotative impact on such a process.

OpenIE4 shows a growth of 0.09 points in precision, 0.11 in recall, and 0.12 in F1-score. It

is the extractor with the highest precision, 0.51. OpenIE 4 main innovation was the extraction of

n-ary relations. The extractor works with the opposite idea of Staford OIE, as it uses semantic

role labeling (SRL) to retrieve as much information as possible, using concepts similar to ours

34

(e.g. time context), instead of simplifying. Our application is not entirely complementary to

this process, as it replaces parts of OpenIE 4 functions. This situation was observed in the

extractions, as most relations followed the traditional format, and there was only a small number

of n-ary extractions. Nevertheless, the use of SRL to extract relations proves to be effective,

comparatively with the other extractors, and the reduction of noise and complexity in the input

sentences are able to improve that process.

0.0 0.1 0.2 0.3 0.4 0.5Recall

0.0

0.2

0.4

0.6

0.8

1.0

Pre

cisi

on

ReVerb

Raw Dialogue - AUC = 0.081Processed Dialogue - AUC = 0.092

0.0 0.1 0.2 0.3 0.4 0.5Recall

0.0

0.2

0.4

0.6

0.8

1.0

Pre

cisi

on

OLLIE


0.0 0.1 0.2 0.3 0.4 0.5Recall

0.0

0.2

0.4

0.6

0.8

1.0

Pre

cisi

on

Stanford OIE


0.0 0.1 0.2 0.3 0.4 0.5Recall

0.0

0.2

0.4

0.6

0.8

1.0

Pre

cisi

on

OpenIE 4


Figure 5.1: Comparisons between the Precision-Recall curves of each extractor, measured forthe ground truth (blue), and text processed by our application (red).

Figure 5.1 shows the differences between precision-recall curves between the ground truth

results, and the results generated by applying our system. Every extractor shows overall im-

provements. ReVerb has the smaller change, increasing only the number of correct extractions

for the top values of confidence. OLLIE demonstrates a larger improvement relative to pre-

cision, specially with medium values of confidence (in the range of 0.4 to 0.7). Stanford OIE

reduces the local minimum visible in the ground truth and has a smoother curve, improving

the precision for medium values of confidence. OpenIE4 shows the best results, confirmed

by its AUC, and presents an increase at all levels of confidence. In Figure, 5.2 we can see a

comparison of the precision-recall curve for every extractor side-by-side.

35

0.0 0.1 0.2 0.3 0.4 0.5Recall

0.0

0.2

0.4

0.6

0.8

1.0

Pre

cisi

on

ReVerb - AUC = 0.092OLLIE - AUC = 0.189Stanford OIE - AUC = 0.192OpenIE 4 - AUC = 0.205

Figure 5.2: Side-by-side comparison of all four precision-recall curves, measure for the dialogueprocessed by our application-

5.2 Problems

In order to analyze the reliability of our system, i.e., if the modules implemented were successful

in removing noise, and simplifying and dividing sentences, we manually annoted errors found in

the XML structures, and measured their frequency. We define these errors as situations where

the system did not produce the desired output, e.g., it did not divide a sentence where it was

suppose to, or it removed information wrongly identified as noise. Table 5.2 summarizes those

results.

The problems encountered with noise removal were mostly in regard to interruptions, in

situations where the algorithm incorrectly classified interruptions, and situations where it failed

to properly join the sentences.

The division of sentences generated errors in situations where it should have been able to

divide, but it was not capable of accomplishing it.

As for the retrieval of complementary information, the majority of problems occurred while

trying to identify the temporal context. Almost every error was due to a fault in the regular

expressions used, as they were not capable of identifying a large variety of temporal phrases.

36

Error Frequency

Noise

Repetitions 2%

Meaningless Expressions 1%

Interruptions 7%

Sentence Division 15%

Complementary

Information

Temporal Context 32%

Condition 0%

Cause 9%

Belief Modifier 3%

Table 5.2: Frequency of errors encountered in the XML structures, for each task.

5.3 Discussion

Comparatively to our benchmark, our system improved every result for each extractor. The first

conclusion is that the problems identified, namely the noise, and the size of sentences, have

a real impact on the identification of relations, and imply that extended studies on these topics

may improve further the results.

As for the comparative study, ReVerb proved to be weaker than the other three extractors,

not justifying added efforts in its improvement.

OLLIE was not the best extractor in neither metric, but demonstrated comparatively bal-

anced results. Due to its learning component, it is possible that if OLLIE is trained with a

bootstrap of relation seeds taken from dialogue transcriptions, its results can improve signifi-

cantly.

Stanford OIE and OpenIE presented the best result in recall and precision, respectively.

Stanford OIE having the best F1-score. Although their results are low if compared to the re-

ported results in structured text, they showed a significant improvement from the ground truth.

As a final conclusion, with this study we demonstrated that it is possible to improve the

extraction of relations from dialogue transcriptions, and we recommend further studies in this

field, specially using the extractors Stanford OIE and OpenIE 4

37

38

Chapter 6

Conclusions

Open Information Extraction (Open IE) is becoming an important process in the retrieval of

information and human text processing. Therefore, it is important to extend the scope of such

tools to spoken dialogue transcriptions.

AS far as we know, our work provides one of the first studies of the application of OpenIE

to dialogues texts. We were able to understand the limitations of OpenIE tools applied to this

type of text, and provided a first attempt at a framework capable of resolving the problems that

jeopardize the generation of good results. By removing noise from text, decreasing the size of

sentences by dividing them, and by identifying complementary information to the relation, that

can be stored outside the sentence, we were able to improve precision, recall, and F1-score

comparatively to our ground truth. Despite that these results are not yet ready to be useful in

real world applications, the improvements demonstrate that there is potential in pursuing further

studies on this topic.

We hope that this work provides a reference study to the application of Open IE to spoken

dialogue transcriptions.

6.1 Contributions

The main contributions of the research reported on this dissertation are:

• A comparative and detailed study of four Open IE tools, namely ReVerb, OLLIE, Stanford

OIE and OpenIE 4, when applied to dialogue text;

• An application to be used together with an OpenIE tool, that shows an improvement in

precision of 11 percentage points (pps), recall by up to 12 pps, and F1-score by 11 pps

for the best obtained results, which improved the quality of extractions;

39

• A definition of a structure in order to improve the extraction paradigm, capable of storing

complementary information to the relation.

6.2 Future Work

As future work we propose that the following points should be addressed:

• The creation of a larger labeled dialogue dataset can be crucial to the improvement of our

system.

• Extending our framework to include types of sentences not considered in this study,

namely questions.

• Improving sentence division by taking into account more types of subordinating clauses.

• Improving information removal by addressing more types of clauses, e.g., location.

40

Bibliography

[1] F. Wu and D. S. Weld. Open information extraction using wikipedia. In Proceedings of the

Annual Meeting of the Association for Computational Linguistics, pages 118–127, 2010.

[2] O. Etzioni, M. Banko, S. Soderland, and D. S. Weld. Open information extraction from the

web. Communication of ACM, pages 68–74, 2008.

[3] T. Lin, Mausam, and O. Etzioni. Identifying functional relations in web text. In Proceedings

of the Conference on Empirical Methods in Natural Language Processing, pages 1266–

1276, 2010.

[4] A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction.

In Proceedings of the Conference on Empirical Methods in Natural Language Processing,

pages 1535–1545, 2011.

[5] Mausam, M. Schmitz, R. Bart, S. Soderland, and O. Etzioni. Open language learning

for information extraction. In Proceedings of the Joint Conference on Empirical Methods

in Natural Language Processing and Computational Natural Language Learning, pages

523–534, 2012.

[6] S. Soderland, J. Gilmer, R. Bart, O. Etzioni, and D. Weld. Open information extraction to

kbp relation in 3 hours. In Text Analysis Conference - Knowledge Base Propagation, 2013.

[7] G. Angeli, M. J. Premkumar, and C. D. Manning. Leveraging linguistic structure for open

domain information extraction. In Proceedings of the Annual Meeting of the Association of

Computational Linguistics, 2015.

[8] A. Yates, M. Cafarella, M. Banko, O. Etzioni, M. Broadhead, and S. Soderland. Textrun-

ner: Open information extraction on the web. In Proceedings of the Annual Conference of

the North American Chapter of the Association for Computational Linguistics: Demonstra-

tions, pages 25–26, 2007.

[9] D. Jurafsky and J. H. Martin. Speech and Language Processing. Prentice-Hall, 2009.

41

[10] S. Petrov, D. Das, and R. T. McDonald. A universal part-of-speech tagset. CoRR, 2011.

[11] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of

english: The penn treebank. Computational Linguistic, pages 313–330, 1993.

[12] F. Karlsson, A. Voutilainen, J. Heikkila, and A. Anttila. Constraint Grammar: A Language-

independent System for Parsing Unrestricted Text, pages 165–284. Mouton de Gruyter,

1995.

[13] S. Abney. Part-of-speech tagging and partial parsing. In Corpus-Based Methods in Lan-

guage and Speech, pages 118–136, 1996.

[14] S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python, chapter 7,

pages 264–277. O’Reilly Media, Inc., 2009.

[15] P. G. Otero. The meaning of syntactic dependencies. Linguistik online, 2008.

[16] M.-C. d. Marneffe and C. D. Manning. Stanford Typed Dependencies Manual. Stanford

University, 2015.

[17] J. Nivre. Dependency grammar and dependency parsing. Technical report, Vaxjo Univer-

sity, 2005.

[18] S. Kubler, R. McDonald, and J. Nivre. Dependency Parsing. Morgan and Claypool, 2009.

[19] L. Marquez, X. Carreras, K. C. Litkowski, and S. Stevenson. Semantic role labeling: An

introduction to the special issue. Computational Linguistics, pages 145–159, 2008.

[20] D. Gildea and D. Jurafsky. Automatic labeling of semantic roles. Computational Linguistics,

pages 245–288, 2002.

[21] C. F. Baker, C. J. Fillmore, and J. B. Lowe. The berkeley framenet project. In Proceed-

ings of the Annual Meeting of the Association for Computational Linguistics and of the

International Conference on Computational Linguistics, 1998.

[22] M. Ciaramita and M. Surdeanu. Desrl: A linear-time semantic role labeling system. In

Proceedings of the Conference on Empirical Methods in Natural Language Processing,

2008.

[23] K. K. Schuler. Verbnet: A Broad-coverage, Comprehensive Verb Lexicon. PhD thesis,

Philadelphia, PA, USA, 2005. AAI3179808.

42

[24] A. Bjorkelund, L. Hafdell, and P. Nugues. Multilingual semantic role labeling. In Proceed-

ings of the Conference on Computational Natural Language Learning: Shared Task, pages

43–48, 2009.

[25] R. Morante, V. V. Asch, and A. V. D. Bosch. Dependency parsing and semantic role labeling

as a single task. Proceedings of the Conference on Computational Natural Language

Learning: Shared Task, 2009.

[26] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The

Stanford CoreNLP natural language processing toolkit. In Association for Computational

Linguistics System Demonstrations, pages 55–60, 2014.

[27] D. Chen and C. Manning. A fast and accurate dependency parser using neural networks.


pages 740–750, Oct. 2014.

[28] J. Nivre, M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald,

S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman. Universal dependencies

v1: A multilingual treebank collection. In Proceedings of the International Conference on

Language Resources and Evaluation (LREC 2016), may 2016.

[29] J. van Benthem. A brief history of natural logic. Technical Report PP-2008-05, University

of Amsterdam, 2008.

[30] D. Jurafsky, E. Shriberg, and D. Biasca. Switchboard swbd-damsl shallow-discourse-

function annotation coders manual, draft 13. 1997.

[31] K. Boyd, K. H. Eng, and C. D. P. Jr. Area under the precision-recall curve: Point estimates

and confidence intervals. In ECML/PKDD (3), volume 8190 of Lecture Notes in Computer

Science, pages 451–466. Springer, 2013.

[32] J. C. Collados. Splitting complex sentences for natural language processing applications:

Building a simplified spanish corpus. Procedia - Social and Behavioral Sciences, 95:464

– 472, 2013. ISSN 1877-0428.

[33] M. Bronzi, Z. Guo, F. Mesquita, D. Barbosa, and P. Merialdo. Automatic evaluation of rela-

tion extraction systems on large-scale. In Proceedings of the Joint Workshop on Automatic

Knowledge Base Construction and Web-scale Knowledge Extraction, pages 19–24, 2012.

43

[34] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld,

and A. Yates. Unsupervised named-entity extraction from the web: An experimental study.

In Artificial Intelligence, pages 91–134, 2005.

[35] J. Schmidek and D. Barbosa. Improving open relation extraction via sentence re-

structuring. In Proceedings of the International Conference on Language Resources and

Evaluation, 2014.

[36] O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam. Open information

extraction: The second generation. In Proceedings of the International Joint Conference

on Artificial Intelligence, pages 3–10, 2011.

[37] M. Recasens, M.-C. de Marneffe, and C. Potts. The life and death of discourse enti-

ties: Identifying singleton mentions. In Proceedings of NAACL-HLT 2013, pages 627–633,

2013.

[38] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky. Stanford’s

multi-pass sieve coreference resolution system at the conll-2011 shared task. In Proceed-

ings of the Conference on Computational Natural Language Learning: Shared Task, pages

28–34, 2011.

[39] E. Bengtson and D. Roth. Understanding the value of features for coreference resolution.


pages 294–303, 2008.

[40] K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and

C. Manning. A multi-pass sieve for coreference resolution. In Proceedings of the Confer-

ence on Empirical Methods in Natural Language Processing, pages 492–501, 2010.

[41] V. Ng and C. Cardie. Improving machine learning approaches to coreference resolution.

In Proceedings of the Annual Meeting on Association for Computational Linguistics, 2002.

[42] W. M. Soon, H. T. Ng, and D. C. Y. Lim. A machine learning approach to coreference

resolution of noun phrases. Computational Linguistics, pages 521–544, 2001.

[43] P. Kingsbury and M. Palmer. From treebank to propbank. In Language Resources and

Evaluation, 2002.

[44] S. Soderland. Learning information extraction rules for semi-structured and free text. Ma-

chine Learning, pages 233–272, 1999.

44

[45] E. Riloff. Automatically generating extraction patterns from untagged text. In Proceedings

of the National Conference on Artificial Intelligence, pages 1044–1049, 1996.

45

46

Documents

Open Information Extraction from Dialogue Transcriptions