Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Open Information Extraction from Dialogue Transcriptions
José Jorge Marcos Raposo
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Supervisors: Doctor David Manuel Martins de MatosDoctor Bruno Emanuel da Graça Martins
Examination Committee
Chairperson: Doctor José Carlos Alves Pereira MonteiroSupervisor: Doctor David Manuel Martins de Matos
Member of the Committee: Doctor Pável Pereira Calado
October 2016
ii
Acknowledgments
To my supervisors, thank you for the support, the guidance, and the help in every way possible
along this entire journey.
To my parents, and grandparents, for their unconditional love, understanding, support, and
patience, without whom I would not be able to be here today.
To my brother Diogo, for the motivation, the help, the advice, the last minute solutions, and
for showing me the way.
To Pedro, for the amazing support and motivation, even from far away.
To Carolina, for the caring and patience, and for helping me grow in every moment of this
adventure.
To Fredy, Simoes, Rita, Joao and Luis, for never giving up on me and pushing me to the
end.
To Rafa, Diogo, Rute, Ed and Francisco, for the tremendous help at the worst time.
To Sara and Mafalda, for all the companionship and encouragement.
To Catarina, Miguel, Lara, Catarina, Mario, Carlos, Ana, Berta, Christelle, Tiago, Filipe,
Anouk, Sofia, Catarina, Adelino, Ana, Antonio, Lipinho, John, Vanessa, Igor, Afonso, Cristiana,
Iryna and Isabel, for that motivation at the right moment that made the entire difference.
Thank You!
iii
iv
Resumo
A Extracao Aberta de Informacao consiste na tarefa de encontrar uma representacao es-
truturada para as relacoes e declaracoes presentes em texto de lıngua natural, sem usar
uma categorizacao predefinida dos tipos de relacoes que irao ser extraıdos. Ferramentas
de extracao aberta de informacao exploram informacao lexical, juntamente com informacao
sintatica e/ou semantica de frases, para procurar relacoes nelas. Aplicar metodos de extracao
de informacao a transcricoes de dialogo, um tipo especıfico de texto que e menos estruturado
e consistente que texto formal, resulta numa reducao de desempenho. Para resolver este
problema, em primeiro lugar, vamos apresentar um estudo comparativo entre quatro sistemas
de extracao aberta de informacao, nomeadamente os sistemas ReVerb, OLLIE, Stanford OIE
e OpenIE 4, com o objetivo de analisar e estudar essa diferenca de resultados. Em segundo
lugar, iremos implementar uma aplicacao que, juntamente com as ferramentas de extracao
mencionadas, resulta num aumento da precisao em 12 pontos percentuais (pps), de sensi-
bilidade em 11 pps, e de F1-score em 11 pps, para os melhores resultados obtidos, o que
melhorou a qualidade das extraccoes. O nosso sistema pre-processa texto de dialogo antes
de este ser passado as ferramentas de extraccao, simplificando e dividindo frases usando
varias tecnicas de processamento de lıngua natural.
Palavras-chave: Extracao Aberta de Informacao, Stanford OpenIE, OpenIE 4, OLLIE,
ReVerb, Dialogo em Lıngua Natural.
v
vi
Abstract
Open Information Extraction (OpenIE) is the task of finding a structured representation for the
relations and assertions present in natural language text, without using a predefined catego-
rization of the relationship types that are to be extracted. OpenIE tools exploit word tokens,
together with syntactic and/or semantic information from sentences, to search for relations in
them. Applying OpenIE methods to dialogue transcriptions, a specific kind of text that is less
structured and consistent than formal text, results in a significant decrease in performance. To
address this issue we first present a comparative study between four OpenIE systems, namely
ReVerb, OLLIE, Stanford OIE, and OpenIE4, intended to analyze and justify the difference in
results. After these initial tests, we implemented an application that, used together with the
aforementioned Open IE tools, increases precision by 11 percentage points (pps), recall by up
to 12 pps, and F1-score by 11 pps for the best obtained results, which improved the quality
of extractions. Our system pre-processes dialogue text before being passed to the extrac-
tion tools, by simplifying and dividing the sentences using several natural language processing
techniques.
Keywords: Open Information extraction, Stanford OpenIE, OpenIE 4, OLLIE, ReVerb,
Dialogue in Natural Language.
vii
viii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction 1
1.1 Topic Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Open Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Part-of-speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.3 Noun Phrase Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.4 Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.5 Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Stanford CoreNLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Open Information Extraction Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 ReVerb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 OLLIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.3 OpenIE 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.4 Stanford Open Information Extraction . . . . . . . . . . . . . . . . . . . . 12
ix
3 Introductory Analysis 15
3.1 Dataset Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Open IE: A Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Reported Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Dataset Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Strategy Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Implementation 25
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Input Requisites and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Input Requisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Sentence Cleaner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Sentence Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.1 Coordinate Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2 Causal Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.3 Temporal Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.4 Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.5 Belief Modifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Relation Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Evaluation 33
5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Conclusions 39
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Bibliography 41
x
List of Tables
2.1 Relation phrase corresponding to the sentence “The dinosaurs became extinct.“ 3
2.2 POS Tagging of Larry Niven’s quote “The dinosaurs became extinct because
they didn’t have a space program”, using the Penn TreeBank tagset. DT: Deter-
miner; NNS: Noun, plural; VBD: Verb, past tense; JJ: Adjective; IN: Preposition;
PRP: Personal pronoun; RB: Adverb; VB: Verb, base form; NN: Noun, singular. . 4
2.3 Pattern used by ReVerb to search for relations. Every relation must be matched
with one of the following rules. ReVerb searches for the longest match it can find,
i.e., it gives priority to the rule VP over the rule V. . . . . . . . . . . . . . . . . . . 10
3.1 Precision and Recall values reported at the TAC KBP Slot Filling Challenge 2013. 19
3.2 Initial results obtained when testing the four extractors with raw dialogue. . . . . 19
3.3 Frequency of different kinds of noise in the incorrect extractions with noise related
problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Frequency and examples of different kinds of subordinations in incorrect extrac-
tions originated from long sentences. Conjunctions are marked in bold. . . . . . 22
5.1 Number of extractions, precision, recall, and F1-score obtained for the ground
truth, and for processed dialogue. The best result in each section is marked in
bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Frequency of errors encountered in the XML structures, for each task. . . . . . . 37
xi
xii
List of Figures
2.1 Dependency parse for the sentence The dinosaurs became extinct because they
didn’t have a space program. Words are labeled with the respective POS tag and
arcs are labeled with the name of the relation. det : Determiner; nsubj : Nomi-
nal Subject; xcomp: Open clausal complement; mark : Marker; aux : Auxiliary;
neg: Negation Modifier; advcl : Adverbial clause modifier; dobj : Direct object;
compound or nn: Noun compound modifier. . . . . . . . . . . . . . . . . . . . . . 6
2.2 Semantic role labelling of the sentence The dinosaurs became extinct because
they didn’t have a space program. Words are marked with their respective POS
tag, and arcs point to the grammatical function the node has to the head. SBJ:
Subject; NMOD: Nominal modification; PRD: Second predicate; PRP: Purposer;
SUB: Subordinate; ADV : Adverbial modification; VC: Verb Complement; OBJ:
Object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Example of an extraction taken from the sentence “I have two dogs”, with its
confidence value. The first and second arguments are, respectively, I and two
dogs, and the relation phrase is have. . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Precision-Recall curve for every extractor, along with the Area Under the Curve
(AUC). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Arquitecture of our application. The three main modules process the data in
sequence, while the Output Structure Organizer module works in parallel with
them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Example of the output xml structure for the sentence “Last year I though it was a
good sacrifice to make, because I wanted to spend some time with my kids.” . . 27
xiii
4.3 Example of an interruption being identified and resolved. Each > indicates a
different sentence. On the left, the second speaker interrupted the first with a
comment. The algorithm identifies that sentence with less than three words, so
it removes it and joins the first and third sentences, creating the sentence seen
on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1 Comparisons between the Precision-Recall curves of each extractor, measured
for the ground truth (blue), and text processed by our application (red). . . . . . . 35
5.2 Side-by-side comparison of all four precision-recall curves, measure for the dia-
logue processed by our application- . . . . . . . . . . . . . . . . . . . . . . . . . 36
xiv
Chapter 1
Introduction
Open Information Extraction (Open IE) is the task of finding a structured representation for the
relations and assertions present in natural language text, without using a predefined catego-
rization of the relations that are to be extracted. Open IE has several purposes, e.g., to extract
factual information from a given source [1, 2], or to acquire common sense knowledge [3].
The existing tools show good results when applied to grammatically well-formed text, such
as news articles, or an article in Wikipedia. However, when faced with less structured text,
such as dialogue transcriptions, the results decrease. The quality of the grammatical structure
of dialogue texts varies significantly, and dialogue is usually prone to interruptions, repetitions
and mid-sentence corrections, noise, errors, and other problems.
The aforementioned problems limit the usefulness of such tools is limited, as dialogue is a
kind of natural language that is present everywhere. On a daily basis, it is the favored format for
people to communicate with each other. Hence, the information present in dialogues, of many
different kinds (e.g., transcriptions of public speeches, or movie subtitles) can have significant
importance for several types of applications, such as:
• Extracting information from dialogues in surveillance systems;
• Application for real-time validation of topics in a group conversation;
• Automatic relevant information retention for debates;
• Among numerous others.
Current Open IE tools are not prepared to extract useful information from a dialogue. There-
fore, the motivation of the research study presented in this dissertation was to try to improve
the way that Open IE applications process human dialogue.
1
1.1 Topic Overview
In this project, we will start by studying some of the current Open IE tools, specifically ReVerb
[4], OLLIE [5], OpenIE 4 [6] and the Stanford Open IE system (Stanford OIE) [7]. We will see
how these tools process dialogue, in comparison to more structured text. Considering that each
of these tools uses a different approach, the objective is to try to understand which one shows
the best results to the different kinds of dialogues and their problems.
Secondly, and without changing the algorithms behind those tools, different processing
techniques will be applied to the texts in sequential layers, to try to improve the quality of
the data, and the subsequent quality and amount of extracted relations.
1.2 Objectives and Contributions
The three main objectives of this thesis were:
1. To provide a comparative evaluation of four Open IE tools when applied to dialogue;
2. To implement a system that improves the quality of relations extracted from dialogue
transcriptions, used in conjunction with Open IE tools;
3. To propose an improved format for the structure of the information extracted.
1.3 Thesis Outline
This document is organized as follows: Chapter 2 introduces the concept of Open Information
Extraction and provides some background into fundamental concepts from the area of Natural
Language Processing. Chapter 3 provides a comparative study between the four extractors,
when applied to dialogue text. This chapter also analyses the dataset that will be used in the
evaluation. Chapter 4 describes the implementation of a system that will attempt to improve the
results of each Open IE tool. Chapter 5 presents the experimental evaluation of our application.
Finally, chapter 6 presents the conclusions of this dissertation, and highlights possible paths for
future work.
2
Chapter 2
Background
This chapter presents a summary of fundamental concepts necessary to understand the fol-
lowing chapters. Section 2.1 introduces the concept of Open IE. Section 2.2 addresses several
techniques used in natural language processing. Sections 2.3 and 2.4 introduce us to the
different Open IE and NLP tools that will be used in this work.
2.1 Open Information Extraction
Open Information Extraction (Open IE) is the task of automatically extracting structured infor-
mation from text, in the form of relation tuples (a tuple is a finite ordered list of elements). A
relation phrase is a sentence expressing a relation between two arguments. In the example
shown in Table 2.1, the first argument is The dinosaurs and the second is extinct, while the
relation phrase is became.
The dinosaurs became extinct
Argument 1 Relation Phrase Argument 2
Table 2.1: Relation phrase corresponding to the sentence “The dinosaurs became extinct.“
Open IE differs from normal IE by being domain independent. This means that the theme of
the text is not known in advance. In traditional IE systems, the extractions have a very specific
target. These systems use human-made extraction rules, or rules learned by domain-specific
supervised methods. Open IE is instead based on creating the extraction rules automatically,
and making them independent from the context or domain. An Open IE system takes as input
a corpus of raw text, ideally makes a single pass over it, and outputs a set of extracted relations
[2, 8].
3
2.2 Natural Language Processing
Natural Language Processing (NLP) is the field concerned with the interaction between humans
and computers at the level of human language. Open IE is a subtask from this field, so it
takes advantage of several of its procedures. The following subsections will introduce the most
important techniques used by Open IE tools, and by this work.
2.2.1 Tokenisation
Tokenisation is the task of segmenting text into parts, such as words, sentences, or groups in
between. The process of tokenisation is dependent of the text’s language. Usually, punctuation
has to be taken into account, and more than often the segmentation rules are dependent on
the context (for instance, apostrophes are used with more than one purpose). Tokenisation can
be achieved using regular expressions, which are robust although somewhat strict, or through
machine learning methods, which are more general [9].
2.2.2 Part-of-speech Tagging
Part-of-Speech (POS) tagging is the process of assigning a part-of-speech category to each
token in a corpus. Parts-of-speech are classes of words, based of their function in a text, and
based on the similarities in their behaviour. For instance, in a simple tagset with 12 classes
(i.e. a closed group of possible tags), tokens can be classified as nouns, verbs, adjectives,
adverbs, pronouns, determinants or articles, prepositions, numerals, conjunctions, particles
and punctuation marks, as described in [10].
For more complex studies and for a better understanding and classification of text, larger
tagsets can also be used, like the Penn TreeBank tagset that considers 45 classes [11] and the
61 class C5 tagset1. Table 2.2 shows an example of a classified sentence.
The dinosaurs became extinct because they did n’t have a space program
DT NNS VBD JJ IN PRP VBD RB VB DT NN NN
Table 2.2: POS Tagging of Larry Niven’s quote “The dinosaurs became extinct because theydidn’t have a space program”, using the Penn TreeBank tagset. DT: Determiner; NNS: Noun,plural; VBD: Verb, past tense; JJ: Adjective; IN: Preposition; PRP: Personal pronoun; RB:Adverb; VB: Verb, base form; NN: Noun, singular.
To tag a corpus, there are several algorithms available, based on different methods. Modern
1http://www.natcorp.ox.ac.uk/docs/c5spec.html
4
approaches mainly leverage rule-based or probabilistic methods. In both cases, the input is a
string of words and a tagset, and the output is a single best tag for each word.
Rule-based methods generally use a two-stage architecture. In the first stage, a tagger
assigns all possible parts-of-speech for each word. Then the tagger applies a large set of
constraints to the input sentence to filter the incorrect parts-of-speech, returning a single POS
entry for each word. A good implementation of this approach, using the tagger named EngCG2,
is described in [12].
Probabilistic methods, instead of using heuristics, select a tag for a word based on prob-
abilities assigned to each tag. These probabilities are computed with a model built through
supervised learning. A known algorithm based on probabilistic methods is the Hidden Markov
Model (HMM) POS tagger. This algorithm first builds all possible tag sequences for a given
sentence, and then chooses the sequence with the highest probability. The probabilities for
each tag are computed by multiplying two terms: the probability of a certain tag occurring after
another (i.e. a bigram sequence) and the probability of, given a certain tag, being associated
with a word. Both probabilities are measured by counting the specific occurrences in a labeled
corpus. The final part of the algorithm is typically named decoding (i.e. choosing the best tag
sequence) and is done using a dynamic programming algorithm like the one from Viterbi [13, 9].
2.2.3 Noun Phrase Chunking
Chunking, often referenced to as shallow parsing, is the technique used to segment and label
the tokens of a sentence into its subconstituents, such as the noun or verb phrases.
Chunking a sentence requires for tokens to first be POS tagged, due to the fact that the
POS labels are going to be used in patterns for recognizing phrases. Usually, a text chunking
process starts by tokenising the sentence, then it tags it with a POS tagset, and finally chunks
it. There are two main methods to chunk a sentence: using regular expressions and training
chunk parsers.
The first method uses a grammar based on regular expressions, which essentially defines
rules using POS tag patterns. For instance, the pattern < DT >? < JJ.∗ > ∗ < NN.∗ > +
will chunk sequences starting with an optional determiner, < DT >?, followed by zero or more
adjectives, < JJ.∗ > ∗, succeed by one or more nouns, < NN.∗ > +. Several rules like this are
defined, and the chunk parser searches the text for every chunk it can find, without overlapping
with others.
The second method uses a large corpus for training a chunk parser. First a corpus is
2http://ww2.lingsoft.fi/doc/engcg/
5
labelled using a system like the IOB format, where each token in the corpus receives one of
three tags: B if it marks the beginning of the chunk, I if the token is inside the chunk, and O if it
is outside the chunk. Then, the parser goes through the corpus and tries to learn as much tag
patterns as possible, to compare them with the sentences it is working with [14]. The patterns
that are learned rely on lexical items and/or POS tags as features.
2.2.4 Dependency Parsing
Dependency parsing is a method for deconstructing sentences into a graph of dependencies
between words, based on a dependency grammar. Dependency grammars are a form of syn-
tactic representation, i.e. a set of rules and criteria that define binary relations between two
words, in the form of dependencies. Dependency relations have a head and a dependent,
which means that there is always a word that has a dependency to another, and this relation
is asymmetric. These relations are achieved by an operation that takes the POS tags from the
two words and returns a more elaborate arrangement of both. The resultant relation is then
labelled by a tag that defines it, and imposes linguistic restrictions on the linked words [15].
Root
became/VBD
have/VB
program/NN
space/NNa/DT
det compound
n’t/RBdid/VBDthey/PRPbecause/IN
mark
nsubjaux neg
dobj
extinct/JJdinosaurs/NNS
The/DT
det
nsubj xcomp advcl
root
Figure 2.1: Dependency parse for the sentence The dinosaurs became extinct because theydidn’t have a space program. Words are labeled with the respective POS tag and arcs arelabeled with the name of the relation. det : Determiner; nsubj : Nominal Subject; xcomp: Openclausal complement; mark : Marker; aux : Auxiliary; neg: Negation Modifier; advcl : Adverbialclause modifier; dobj : Direct object; compound or nn: Noun compound modifier.
Figure 2.1 shows an example of the resulting dependency graph after parsing the sentence
The dinosaurs became extinct because they didn’t have a space program. For each arc, the
6
word on top is the head, the lower word is the dependent, and the tag in the arc is the name
of the relation. For instance, it is possible to see that the word dinosaurs has a det relation
with the word the. This relation is named determiner and dictates that a determiner (the) must
accompany a noun (dinosaurs) [16].
There are two main classes of methods to parse a sentence, according to syntactic depen-
dencies: grammar-driven and data-driven. In both of them, the input is always a sentence and
the output is a dependency graph [17].
Grammar-based methods for dependency parsing rely only in a well-defined formal gram-
mar. This implies that the parser will be more restrictive. Two possible approaches for parsing
sentences with grammar-driven methods are: (i) conversion to a context-free grammar, in which
dependencies are represented as production rules, and (ii) projection to a constraint satisfac-
tion problem, where the analysis is restricted by a set of constraints present in the grammar.
On the other hand, data-driven methods aim to learn a good predictor of dependency trees,
based on supervised training, which builds a less robust but more general model. Within data-
driven methods, there are two main approaches: (i) transition systems, that at each step of
the process of building the dependency graph try to find the highest scoring dependency in the
sequence, and (ii) graph based systems that, for a given sentence, compute all possible trees
according to the model and choose the one with the highest total score [18].
2.2.5 Semantic Role Labeling
The semantic analysis of text refers to the characterization of events that answer questions like
who did what to whom, where, when and how. Semantic Role Labelling (SRL) is the task of
identifying the relations of a target verb to its corresponding participants, i.e., for every verb
there is a pre-specified list of possible semantic roles, and the task of SRL tries to assign each
role to some noun-phrase element in the sentence [19].
A semantic role is a class in linguistic theory, and the constituents of each class vary ac-
cording to the author. Roles can go from extremely specific and verb-dependent, to extremely
general. For instance, in a more specific definition, the verb attack can have the semantic roles
of the attacker and the victim, and the verb operate can have the roles of doctor and patient,
while in a more general definition both verbs have instead the roles of agent and patient [20].
There are three mainly accepted definitions of semantic classes, associated to the projects
FrameNet [21], PropBank [22], and VerbNet [23].
A normal SRL system is composed by a mixture of several other components, in a modular
architecture. First, it has a syntactic parser that is responsible for classifying phrases in the
7
sentence into syntactic categories. Then, it uses a predicate identification and classification
component, to search for the existing predicates. In Figure 2.2, the predicates are become,
did and program. Next, an argument identification component is used to find the arguments for
each predicate. In a last step, the classification of the arguments is done separately from the
previous process, using an argument classification component. An implementation of an SRL
system using the aforementioned model is discussed in [22] and [24].
Although dependency parsing and semantic role labelling are different in the way that they
map to different structures (i.e., dependency graphs map sentences to syntactic dependency
graphs, while semantic roles focus on individual predicates), the results are partially overlap-
ping. For instance, syntactic dependencies may be the same as the semantic roles for those
two words. It is possible to see the similarities in Figures 2.1 and 2.2. This means that it may
be possible to use both methods together to improve the accuracy of a classification system
that attempts to interpret sentences for information extraction [25].
Root
became/VBD
because/IN
did/VBD
have/VB
program/NN
space/NNa/DT
NMOD NMOD
OBJ
n’t/RBthey/PRP
SBJ ADV V C
SUB
extinct/JJdinosaurs/NNS
The/DT
NMOD
SBJ PRDPRP
ROOT
Figure 2.2: Semantic role labelling of the sentence The dinosaurs became extinct because theydidn’t have a space program. Words are marked with their respective POS tag, and arcs pointto the grammatical function the node has to the head. SBJ: Subject; NMOD: Nominal modifica-tion; PRD: Second predicate; PRP: Purposer; SUB: Subordinate; ADV : Adverbial modification;VC: Verb Complement; OBJ: Object.
8
2.3 Stanford CoreNLP
Stanford CoreNLP is a suite of several natural language processing and analysis tools [26].
Among others, it has an annotator for each of the concepts described above. This software
also works as an integrated framework. For any desired process, it receives an input text
and uses any annotators necessary, layer by layer, to generate the output. For instance, to
use the dependency parser, Stanford Core NLP calls the annotator tokenize, to divide the text
into tokens, followed by the annotator ssplit, that will split the sentence into chunks, and the
annotator pos, responsible for applying POS tags, and finally executing the dependency parse
annotator.
The dependency parser in particular will be used in Chapter 4. This dependency parser is
a data-driven transition-based parser, trained using neural networks [27] and that uses a set of
dependencies from the Universal Dependencies project [28].
2.4 Open Information Extraction Tools
The following section will describe and compare four Open IE tools, which will be chronologi-
cally presented: ReVerb, OLLIE, OpenIE 4 and Stanford OIE.
2.4.1 ReVerb
ReVerb [4] is an extractor that treats relation extraction as a constraint satisfaction problem.
Previous extractors were based on classifiers, and produced two frequent types of errors,
namely incoherent extractions and uninformative extractions. Incoherent extractions are ex-
tractions that do not have any meaningful interpretation. For instance, the sentence “Gengis
Kahn conquest was central to the connection between Asian and European cultures” retrieves
the extraction (was, central, connection), which is incoherent. Uninformative extractions are
extractions that omit critical information. The sentence “Joana gave birth to a boy ” produces
the extraction (Joana, gave, birth to), which is uninformative.
The motivation for solving those errors arises from two intuitive observations: first, we have
that a great percentage of the extracted relations follow a simple pattern, meaning that it may
be possible to generalize a set of rules to search a corpus; second, there are two main kinds of
extraction errors that occur frequently and may be solved by using constraints.
ReVerb imposes two constraints on the results. The first, a syntactic constraint, defines
a regular expression pattern of POS tags, seen in Table 2.3. Any potential relation must be
in the form of one of the patterns. If one relation matches to more than one pattern, the
9
V | VP | VW*P
V = verb particle
W = ( noun|adj|adv|pron|det )
P = (prep|particle )
Table 2.3: Pattern used by ReVerb to search for relations. Every relation must be matched withone of the following rules. ReVerb searches for the longest match it can find, i.e., it gives priorityto the rule VP over the rule V.
longest is chosen. This method may sometimes extract overly specific relations, that may not
be useful. To solve this problem, ReVerb uses a second lexical constraint to evaluate if an
extracted relation is capable of being used with other arguments. This is based on the intuition
that a valid relation can take several distinct arguments in a large corpus. To do this, a dictionary
is constructed in an offline step, with a big set of possible relation phrases. The relation being
evaluated is then matched against the dictionary and is only extracted if it exists in there.
The dictionary is built by applying the pattern in Table 2.3 to a corpus. Then, the authors
compute the frequency of relation phrases that use different arguments, and only those with a
value above a predetermined threshold are added to the dictionary.
To extract relations, ReVerb first POS tags and NP chunks the corpus being evaluated.
Then, it uses the above constraints to find and validate relation phrases. If two relations are
contiguous, they are merged. Then, it searches for the relation’s arguments. It is necessary to
find the left and the right arguments, and the boundaries to each. ReVerb uses heuristics to
determine them, based on their POS tags. If a valid relation and both arguments are found, the
system returns an extraction. In a final step, ReVerb uses a confidence function, based on a
logistic regression learner, to attribute a confidence to the extraction.
2.4.2 OLLIE
OLLIE [5] is an evolution of ReVerb, addressing its main flaw, which is the fact the system can
only detect relations mediated by verb patterns. OLLIE also recognizes that some relations
aren’t always factual (e.g., I think that Portugal is a Spanish City) but instead based on beliefs,
hypothetical context, etc.
To solve these limitations, OLLIE expands the syntactic scope of possible relation phrases,
to cover a much larger number of relation expressions. It changes the Open IE representation
to allow additional context information, such as clausal modifiers.
OLLIE’s algorithm is divided into a learning phase and an extraction phase. To learn which
relation patterns are used, OLLIE bootstrapps a large quantity of seed tuples from ReVerb, un-
10
der the assumption that almost all relations can be expressed in a verb like way. For instance
the sentence “The actor John Malkovich (...)” produces the relation (John Malkovich; is; an
actor). Then, OLLIE uses the seed tuples to learn patterns using dependency parsing, which
often appears in a verb mediated form, although this is not the case on the particular example. It
then learns open pattern templates - a mapping from a dependency path to an open extraction,
identifying both the arguments and the exact ReVerb-style relation phrase. These patterns are
ranked in frequency. When the time comes to extract relations, OLLIE simply builds a depen-
dency parse for the sentence and expands the nodes as far as possible to match the patterns
learned before. In the extraction process, there is an additional step that analyzes the context of
the relations, to solve the problem of non-factual relations. OLLIE uses the dependency parse
structure to analyze key nodes. Words like believe, imagine or if are flagged and analyzed and,
if it is the case, OLLIE marks them with a clausal modifier to turn the sentence factual. When
OLLIE cannot give veracity to the sentence with this addition, it just gives it a low confidence.
2.4.3 OpenIE 4
The last iteration of the KnowItAll Project (where ReVerb and OLLIE belong to) is OpenIE 4 [6].
The biggest change introduced by this system was to break the old model of binary relations,
enabling an n-ary extraction from a sentence. N-ary extractions can have 1 or more second
arguments. For instance the sentence “The U.S. president Barack Obama gave his speech on
Tuesday to thousands of people” can be converted to the extractions (Barack Obama, gave,
[his speech, on Tuesday, to thousands of people]).
OpenIE 4 works by using two different main components: SRLIE and RelNoun. The first
component, SRLIE, is the part responsible for identifying n-ary extractions. It builds extractions
using Semantic Role Labelling frames. To do this, it processes each sentence through a depen-
dency parser. Then, it takes the resultant dependency graph and feeds it to a SRL system to
produce the SRL frames. Finally, these frames go through SRLIE to produce n-ary extractions.
First it applies some filters heuristically to remove some SRL frames. Then it determines the
argument’s boundaries, and finally constructs the relation phrases. The second component,
RelNoun, is used to extract relations from noun-phrases, like Queen vocalist Freddy Mercury.
OpenIE 4 also further improves OLLIE’s clausal modifiers, taking into account conditional sen-
tences, and checking if a sentence has a negative or positive assertion.
11
2.4.4 Stanford Open Information Extraction
The Stanford OIE system [7] is the most recent from all the tools analyzed in this section,
published in 2015. This system differs from the others in its methodology: instead of trying
to learn more complex patterns, and using techniques to recognise each time more complex
relations, the main focus of this system is to first simplify the sentences, so in the end there
is only the need to use a small amount of patterns to recognize most of the relations. In a
first step, it preprocesses the sentences to produce smaller, independent and coherent clauses
(i.e., clauses that can stand on their own syntactically and semantically), that are entailed by the
main sentence. To accomplish this, the system traverses a dependency parsing tree, labelling
each arc with one of three actions that indicate the steps that must be taken for the arc and
correspondent subtree. These actions can be:
• Stop: if the subtree below the current arc is not entailed by the parent sentence (the
common action for leaf nodes, for instance).
• Yield: if the system found a new clause that is entailed by the parent sentence.
• Recurse: that points that the subtree has potential to have a clause, but it did not found
one yet.
Additionally, sometimes there is a relation between the parent and the child clause (only
when there is an action of yield or recurse), and three additional actions can be taken to express
that relation:
• Subject Controller: if the current arc is not already a subject arc, the subject of the
parent node is copied and attached as the subject of the child node.
• Object Controller: does the same as the previous action but takes the parent node as
the object instead.
• Parent Subject: if the current arc is the only outgoing arc from a node, the parent node
is assigned as the passive subject of the child.
Using this algorithm, the system trains a multinomial logistic regression classifier using a
labeled dataset, and learns the sequences of actions that produce relations by applying the
algorithm to the dataset. The algorithm uses the extracted relations that match the labeled
relations as positive examples, and the sequences of actions that correctly do not produce any
relations as negative examples.
12
In a second step, the system will try to use the produced clauses (after running the classifier
on a corpus to be evaluated) to generate a maximally compact sentence that retains the core
semantics of the original sentence, i.e. the smallest possible sentence that is yielded by the
parent sentence and is semantically and syntactically correct. To do this, the system uses
rules of Natural Logic described in [29] to check what it can delete from the clauses. This
is done by checking if certain words induce downward or upward polarity (words such as all,
no, many, etc.). If a word, like all induces a more general scope for the sentence (e.g. if All
dogs bark., then it is true that All small dogs bark.), then it is said that the word all induces a
downward polarity. If, on the other hand, a word, like some, induces a more specific scope (e.g.
if Some yellow flowers smell good., then it is possible to say that Some flowers smell good.),
then it is said that the word some induces a upward polarity. The algorithm searches for these
words and classifies each arc into whether deleting the child of the arc makes the father more
general, more specific, or neither, and acts accordingly using constraints and heuristics to short
the sentences that are possible to shorten.
Finally, in a third step, the system uses 6 main patterns to extract relations from the sen-
tences. If the system detects a compound noun (i.e. a noun-phrase that may hold a relation) it
uses 8 more patterns to try to extract that relation [7].
13
14
Chapter 3
Introductory Analysis
This chapter presents an introductory study of the four Open IE tools that were described in the
previous chapter, and an analysis of transcribed dialogue text. OpenIE 4 and Stanford OIE are
the state-of-the-art extractors. ReVerb and OLLIE are older systems. Nevertheless, there are
critical differences in the algorithms used by each system. Considering that there is still a lack
of studies of Open IE applied to dialogues, it is possible that those critical differences may have
an impact on the extractions, ergo our decision to compare the four tools.
The chapter starts with the study and construction of the dataset that we will use. Section
3.2 follows with an explanation of the evaluation method and a description of the metrics used.
Section 3.3 details the context in which each tool was evaluated, and establishes a benchmark
for further studies, by showing the results of the same tools when applied to our dataset. In
section 3.4, we conduct an analysis of the problems observed in the initial study. Finally, in
section 3.5, we present the characteristics of the system proposed to improve the problems
encountered.
3.1 Dataset Study
The dataset used to carry out this study is the Switchboard Dialogue Act Corpus [30]. This
dataset contains a collection of 1157 two-sided telephone conversations, among 543 speakers.
The conversations were transcribed and annotated. The documents have tags that identify
speech acts, such as sounds, disfluencies, and repetitions.
To perform our studies, we chose 8 conversations at random. Open IE systems usually work
at a sentence level, i.e., they analyze one sentence at a time in search for relations. In order
to comply with that methodology, each file was pre-processed to remove major annotations,
and to be structured by sentence. This process generated 500 proper sentences. Chapter 4
15
explains the pre-processing algorithm in further detail.
We manually labeled the sentences by creating a structure where, for each sentence, we
identified the possible relations, and for each relation we separately tagged the first argument,
the relation phrase, and the second argument. There were sentences that did not contain
information to produce an extraction, e.g., “No way, John!”. These sentences were kept, as
they are part of the dataset, and were annotated with no relations. In this study we also did not
consider questions, for the reason that neither of the four Open IE tools is prepared to process
the grammatical structure of a question. To execute our tests, we provided the unlabeled set of
500 sentences to each extractor, and compared the results with the labeled set.
3.2 Evaluation Method
To properly conduct a comparative experiment between all four Open IE applications, it is nec-
essary to understand what kind of information they produce.
Figure 3.1 shows an example of a relation extracted, and its confidence value. For an
extraction to be valid, it must have both arguments of the relation, the relation phrase, and a
confidence value. It is possible, and actually frequent, to have more than one extraction per
sentence, each one with its own confidence value.
0.876 (I ; have ; two dogs)
Figure 3.1: Example of an extraction taken from the sentence “I have two dogs”, with its confi-dence value. The first and second arguments are, respectively, I and two dogs, and the relationphrase is have.
The confidence value is used to attempt to validate the relation extracted in unsupervised
tests. It varies between 0 and 1 and is measured using a classifier. The classifier is specific
to each application and is trained with a set of features related to the different methods used
by them. For instance, considering that ReVerb is based on syntactic patterns to recognize
relations, it uses features such as the number of words and existence of specific conjunctions,
while Stanford OIE takes features from the dependency tree it uses to perform the extraction,
such as the label of edges being taken, and POS tags from edge endpoints.
We classified each relation extracted either as correct or incorrect. A relation is considered
correct if it satisfies the following criteria:
• It must be a valid extraction, according to the definition presented previously;
• It must be syntactic and semantically correct;
16
• It must be informative, i.e., it must contain all the information necessary related to the
event extracted; For example, the extraction (Peter; went; to look) taken from the sentence
“Peter went to look for a house” is not informative, thus, is classified as incorrect; However,
the extraction (Peter; went to look; for a house) is classified as correct;
• It must be locally factual, i.e., it must not contradict or change the information present in
the sentence. For instance, if the sentence “The moon is a star ” produces the extraction
(The moon; is; a star), this extraction is correct, according to the information present in
the sentence. However, the extraction (The earth; was; flat), taken from the sentence
“Ancient cultures believed that the earth was flat”, is not correct, since that is not the
information stated in the sentence.
If an extraction does not comply with all the above criteria, then it is classified as incorrect.
There may be several correct extractions for the same relation, differing only on one or two
words. We also counted the number of possible relations to which no correct extraction was
produced.
For a classification task such as this, there are four kinds of possible results:
• True Positives (TP): extractions that were classified as correct.
• True Negatives (TN): absence of extractions when there is none expected. We do not
count those results here.
• False Positives (FP): extractions that were classified as incorrect.
• False Negatives (FN): sentences where there should have been at least an extraction,
but there was none.
By using these values, we computed the metrics of precision, recall, and F1-score. Pre-
cision is the percentage of extractions that were classified as correct, from all the extractions
obtained:
Precision = TP/(TP + FP )
Recall is the percentage of correct extractions obtained from the universe of all possible
correct extractions in the dataset:
Recall = TP/(TP + FN)
17
F1-score is a weighted average of precision and recall, that measures an overall score of
the system:
F1 = 2 ∗ (Precision ∗Recall)/(Precision+Recall)
The confidence value can be used in unsupervised tests to self-validate an extraction. It
is possible to establish a division point by defining a threshold, e.g., we can predict that all
extractions with confidence above 0.5 are correct, whether it is true or not. And by varying this
threshold, it is possible to increase one of the metrics in detriment of the other, e.g., by lowering
the threshold, we can increase recall at the cost of lower precision. To measure this trade-off,
we will plot a Precision-Recall curve. The curve is computed by providing the confidence value
and the binary classification of each extraction. With this curve plotted, we can calculate the
Area Under the Curve (AUC) [31], which measures the space in the graph below the curve. The
AUC varies between 0 and 1. An higher AUC denotes a better trade-off, i.e., more extractions
are classified as correct at progressively lower confidence values.
3.3 Open IE: A Comparative Study
This section provides a comparative study between ReVerb, OLLIE, Stanford OIE and OpenIE
4. First, it describes the context in which the extractors were evaluated, as well as the results
they reported. Then, it applies the four tools to the test dataset, and analyzes the obtained
results, creating a ground truth for further tests.
3.3.1 Reported Results
In the original publications, both ReVerb and OLLIE were evaluated by human annotators.
ReVerb took extractions from a set of 500 sentences, and OLLIE used a set of 300 sentences.
For both cases, two human annotators classified each extraction as correct or incorrect, and
the final evaluations were taken from the extractions they agreed upon. ReVerb reported a
precision of 0.75, yet no value for recall. OLLIE reported a precision of 0.75, and a recall of 0.5.
OLLIE also stated that it was able to extract 4.4 times more correct relations than ReVerb, and
that it achieved an AUC 2.7 times superior [4, 5].
Stanford OIE and OpenIE 4 took a different approach. Both systems participated in the Text
Analysis Conference: Knowledge Base Population (TAC KBP) Slot Filling Challenge 20131,
which offers, among other challenges, a task to fill a predefined schema of relations, extracted
from a large dataset. The relations are evaluated based on a set of query entities. OLLIE, the
1http://tac.nist.gov/2013/KBP/
18
state-of-the-art system at that time, was also used in the challenge, for comparison. The results
from the TAC KBP Challenge are presented in Table 3.1 [7, 6].
Precision Recall
OLLIE 0.577 0.118
Stanford OIE 0.586 0.186
OpenIE 4 0.698 0.114
Table 3.1: Precision and Recall values reported at the TAC KBP Slot Filling Challenge 2013.
From these results, we can observe that OpenIE 4 is currently the system with higher pre-
cision, while Stanford OIE displays the higher recall.
3.3.2 Benchmarks
The conditions and the dataset used in the TAC KBP challenge are different from our dataset,
and from the extractions of dialogue transcriptions that we intend to study. As a consequence,
the results shown in table 3.1 are not comparable to our evaluation method, and should be
used only as an extrinsic reference. Therefore, it is necessary to calculate a ground truth, i.e.,
a baseline of results obtained by applying the four extractors to our dataset, before conducting
further studies.
Table 3.2 shows the precision, recall, and F1-score obtained with the first tests, along with
the number of extractions. If we take into account the results previously reported, both precision
and recall denote a low score overall.
Extractions Precision Recall F1-score
ReVerb 760 0.36 0.12 0.18
OLLIE 1030 0.37 0.17 0.23
Stanford OIE 1415 0.27 0.24 0.25
OpenIE 4 1123 0.42 0.15 0.22
Table 3.2: Initial results obtained when testing the four extractors with raw dialogue.
ReVerb is the system with less extractions, and the lowest recall and F1-score. OpenIE
4 has the highest precision, yet it shows the second lowest value for recall and F1-score.
OLLIE shows balanced results, compared with the other tools. Stanford OIE has the highest
recall, and highest number of relations extracted, despite showing the lowest precision. We
observed that this larger number of extractions causes Stanford OIE to have larger percentage
of incorrect extractions, thus lowering the precision. Yet, at the same time, it produces more
19
correct extractions than the others, increasing the recall.
0.0 0.1 0.2 0.3 0.4 0.5Recall
0.0
0.2
0.4
0.6
0.8
1.0Pre
cisi
on
ReVerb - AUC = 0.081OLLIE - AUC = 0.154Stanford OIE - AUC = 0.152OpenIE 4 - AUC = 0.164
Figure 3.2: Precision-Recall curve for every extractor, along with the Area Under the Curve(AUC).
Figure 3.2 displays a comparison between the Precision-Recall curve of the four Open IE
systems, together with the respective AUC. The curves demonstrate the values of precision
and recall for a variation of the threshold associated with the confidence value. For instance,
in OLLIE’s curve (in red), all the extractions with a confidence value superior to 0.85 were
classified as correct, thus the curve being a straight line, with precision equal to 1. However,
all the other extractions with lower confidence were considered incorrect, and thus the recall
of 0.1, as only a small percentage of all the possible extractions was considered correct. By
lowering progressively the threshold, we can see an improvement of the recall, in exchange for
a decrease in precision. This trade-off is observed by the evolution of the curves, with a higher
curve implying a better relation. We quantify the trade-off by measuring AUC, i.e., the size of
the area below the curve.
OpenIE 4 shows the best relation, with an AUC of 0.164. We can also observe that the
decline of the curve is regular, denoting that there are progressively more incorrect extractions
as the confidence is reduced, instead of a sudden lack of correct extractions at a tipping point.
The other three extractors show precisely this situation. OLLIE, for instance, has a flat line until
a confidence value of approximately 0.85, declining suddenly after that point. ReVerb has a
decrease early in the graphic, showing that a large number of extractions classified as incorrect
20
had a high confidence value. It has also the lowest AUC, which is less than half of Open IE’s
AUC. Stanford OIE has a local minimum early in the graphic. We observed that this situation is
influenced by Stanford OIE extracting more sentences than the other three systems, as in this
case it was extracting multiple instances of incorrect relations, with high confidence.
The results presented in this section constitute the benchmark for our subsequent studies
and experimentations.
3.4 Dataset Problems
In order to understand the results obtained by each extractor, we analyzed the incorrect extrac-
tions and studied the characteristics of the sentences from which they were generated.
A preliminary observation shows that the size of the sentence is one of the major prob-
lems. Larger sentences originated more incorrect extractions. Likewise, a large percentage of
incorrect extractions was generated from sentences with a bad structure, or with noise, e.g.,
sentences that showed repetitions, interruptions, changes of topic mid-sentence, and grammat-
ically malformed sentences, among others.
A more thorough analysis was conducted to understand and define the kinds of problems
encountered, and to measure their frequency. From the preliminary analysis, we will consider
two types of problems: noise and size of sentences. 32% of extractions classified as incorrect
had noise related problems, while 59% were originated from long sentences.
Noise
We define Noise as grammatical malformations, or expressions that add no value to the
sentence. For instance, a sentence that has a repetition is grammatically malformed. On the
other hand, in the sentence “We went to the store and stuff”, the expression and stuff adds
no value to the information present in the sentence. Table 3.3 summarizes the most frequent
kinds of noise observed in the extractions with noise related errors. It is possible to see that
repetitions occur frequently, being present in 43% of those incorrect extractions.
Relatively to interruptions, we noticed that they are divided in two cases. The first case is
an interruption that consists in a commentary, or an agreement sound of one speaker, while
the other is talking (e.g., I see, hm-hmm). These interruptions usually do not shift the theme
of the sentences interrupted, i.e., the first speaker continues to talk about his original subject.
The second kind of interruption consists in a shift of subject, e.g., “(...) and then we went to the
boat and... Hey, look at the time! We must be going“. While the first case may provide a com-
plete sentence with sufficient information for an extraction, the second case has an incomplete
21
Noise Frequency Example
Repetitions 43% ”and then, then he...”
Meaningless expressions 28% ”and stuff like that”
Interruptions 16%”By the time we...”
”What about your car?”
Others 13% ”We are, you see, in spite of.”
Table 3.3: Frequency of different kinds of noise in the incorrect extractions with noise relatedproblems.
sentence, and it usually will not be possible to retrieve a relation from that sentence.
Long Sentences
We observed that the long sentences that originated incorrect extractions were sentences
with more than one clause. Grammatically, those sentences are divided into two groups:
• Compound Sentences: defined as having two independent clauses, i.e., clauses that
can stand alone syntactically (e.g., “Peter went to the mall and Maria stayed home.”)
• Complex Sentences: defined as having an independent clause and at least a dependent
one, i.e., a clause without meaning if stated alone (e.g., “Peter went to the mall in the
afternoon.”)
Compound sentences are connected by a coordinating conjunction (e.g. and, for, so, yet).
Complex sentences are usually connected by a subordinating conjunction (e.g. when, because,
although, after ). Table 3.4 summarizes the most frequent kinds of subordinations found in the
incorrect extractions caused by long sentences. The subordination type can be recognized by
the conjunction, underlined in the table.
Subordinate Type Frequency Example
Temporal 39% ”I went home when the rain stopped.”
Causal 27% ”My mother went on vacations because she needed a rest.“
Conditional 9% ”Maria will be on time if there’s no traffic.”
Comparative 5% ”She earns as much money as I do”
Others 13% ”Even though he studied a lot, he didn’t pass the exam.”
Table 3.4: Frequency and examples of different kinds of subordinations in incorrect extractionsoriginated from long sentences. Conjunctions are marked in bold.
It is possible to observe that there is a clear prevalence of two types of subordinations: Tem-
poral and Causal. The Others category includes types with smaller appearance frequency,
22
and types that are ambiguous or hard to identify.
Compound sentences commonly have two relations in them. In the example “Peter went to
the mall and Maria stayed home.“, we can extract the relations (Peter; went to; the mall) and
(Maria; stayed; home). Complex sentences mostly only have one event, and thus one relation.
In the example “Peter went to the mall in the afternoon.”, the relation (Peter; went to; the mall)
is still correct. However, the relation (Peter; went to; the mall in the afternoon) is also correct,
as it does not contradict any of our criteria. This ambiguity in the definition of the arguments is
one of the causes of errors in the extractions.
Other Problems
Finally, from the remaining 9% of incorrect extractions, we observed that a frequent occur-
rence was related to the local factual validity, i.e., the extractions where the relation generated
does not correspond to the truth present in the sentence (as seen in the example “Ancient
cultures believed that the earth was flat”). In this case, the sentence has an expression that
we call Belief Modifier, that changes the validity of the event of the sentence. In the previous
example, the belief modifier is “Ancient cultures believed”.
To summarize this study, by analyzing the extractions classified as incorrect, we identified the
following problems in the respective sentences:
• Sentences with noise;
• Long sentences;
• Contradictions between the relation and the sentence.
3.5 Strategy Definition
Based on the previous study, it is now necessary to develop a strategy to attempt to solve those
problems.
In order to decrease the noise, our strategy will be to remove the responsible occurrences of
repetitions and meaningless expressions. As for interruptions, we will treat only the first case,
where there is a continuation of the theme in the sentence after the interruption, and remove
the expression in question. The second case is usually not possible to solve as the information
to complete the interrupted sentence does not exist. With regards to errors in long sentences,
our strategy will be to simplify those instances.
23
In the case of compound sentences, we will split both clauses and create two new subsen-
tences, considering that each clause is independent, and grammatically correct by itself.
As for complex sentences, we will isolate the main event of the sentence, and retrieve the
complementary information given by the dependent clauses. Nonetheless, this information may
be important to the relation, and even necessary to validate it. Therefore, to keep this informa-
tion, we will implement a structure capable of storing both the relation and the complementary
information. The rationale is that by removing these dependent clauses from the sentences
beforehand, the resultant simplifications will be easier to be processed by the extractors. We
will focus our implementation on the three most frequent types of clauses observed in the study
above, which are Temporal, Causal, and Conditional. Coincidentally, this solution is able to take
into account the third problem, corresponding to Belief Modifiers.
This strategy is based in some mechanisms that the Open IE tools already perform. Stan-
ford OIE extracts its relations by first simplifying the sentences. OLLIE and OpenIE 4 extract
some complementary data alongside the relation.
24
Chapter 4
Implementation
This chapter describes the implementation of the system proposed to improve the results in
all four tools, based to the study realized in the previous Section. The chapter starts with an
explanation of the objectives of the application in Section 4.1, followed by the definition of the
input requisites of the system, and the definition of the output format in Section 4.2. Sections
4.3 to 4.5 present a detailed description of every module implemented.
4.1 Overview
As studied in the previous chapter, the major problems observed in our dialogue dataset are
noise, sentence length, and contradictions between the relations and the sentences. In Section
3.5 we defined that the strategy to solve these problems is to remove the noise; to simplify the
sentences; and to store complementary information into an organized structure. As such, our
system is divided into the following five modules:
• Pre-Processor: tasked with adapting the dataset to comply with the input requisites of
the system. It is external to the main pipeline and should be changed in case another
dataset is to be evaluated;
• Sentence Cleaner: used to clean the input data from noise and irrelevant information;
• Sentence Divider: concerned about dividing sentences into subsentences, and simplify-
ing the sentences retrieving complementary information;
• Relation Extractor: responsible for setting and calling the four extractors used in the final
step of the pipeline;
25
• Output Structure Organizer: a module whose task is to store the information generated
in the output of the different Open IE tools to a XML structure.
The system, in Figure 4.1, will work as follows: a pre-processed file will be provided to the
software. The data will pass through the modules sequentially, first cleaning the sentence, then
dividing and simplifying it, and finally extracting its relations. The output organizer will work in
parallel, structuring the information as it is processed. The output data is the XML structure
produced by the Output Organizer.
Pre-Processor
SentenceCleaner
SentenceCleaner
SentenceDivider
RelationExtractor
Output StructureOrganizer
InputData
OutputData
Figure 4.1: Arquitecture of our application. The three main modules process the data in se-quence, while the Output Structure Organizer module works in parallel with them.
4.2 Input Requisites and Output
Before describing the implementation of our system, it is important to define what data can be
used as input, and how the output should be formatted. This section defines the requisites and
the output of our application.
4.2.1 Input Requisites
Every Open IE software that was analyzed operates at a sentence level, which means that all
of the four Open IE tools will extract relations for one sentence at a time.
In this context, a sentence is defined by a set of one or more words, from a single speaker,
textually written with an initial capital letter and an ending punctuation. No other characters and
annotations outside of the context of the transcription are allowed (e.g., an exclamation mark
is part of the context of the speech, but a pair of brackets indicating repetition is not allowed).
Therefore, for data to be accepted by this system, it must be arranged in a sentence-based
structure, i.e., each input file must have a single sentence by line. The subsequent description
of the system will be done at a sentence level.
26
Switchboard is an annotated transcription of dialogues, which means that it is labeled with
meta-information, used to indicate details such as utterance indexes, or existent disfluencies.
These labels are not necessary for our application, hence they were removed.
The next step is to normalize the sentences, by capitalizing every first word, and adding a
full stop where there is a lack of final punctuation. This step is important in the sense that some
extractors use these tokens to aid the recognition process. Nevertheless, we only execute this
step here in tests that do not use the Sentence Cleaner module. Otherwise, we only normalize
the sentences after the Sentence Cleaner, because the lack of final punctuation is crucial in
recognizing interruptions.
4.2.2 Output
In order to store the complementary information that will be retrieved from the sentences, before
the extraction, we created the structure shown in Figure 4.2.
1 <r e l a t i o n s f i l e = ” f i l e −name”>2 <sentence i d = ” 12 ”>3 < t e x t>Last year I thought i t was a good s a c r i f i c e to make , because I
wanted to spend some t ime wi th my k ids .< / t e x t>4 <s i m p l i f i c a t i o n i d = ” 0 ”>5 < t e x t> I t was a good s a c r i f i c e to make .< / t e x t>6 <TC>Last year< /TC>7 <BM> I though< /BM>8 <e x t r a c t i o n i d = ” 0 ” conf idence= ” 0.756 ” c l a s s i f i c a t i o n = ” 1 ”>9 <arg1> i t< / arg1>
10 < r e l>was< / r e l>11 <arg2>a good s a c r i f i c e to make< / arg2>12 < / e x t r a c t i o n>13 < / s i m p l i f i c a t i o n>14 <s i m p l i f i c a t i o n i d = ” 1 ”>15 < t e x t> I wanted to spend some t ime wi th my k ids .< / t e x t>16 <DC>Because< /DC>17 <e x t r a c t i o n i d = ” 0 ” conf idence= ” 0.802 ” c l a s s i f i c a t i o n = ” 1 ”>18 <arg1> I< / arg1>19 < r e l>wanted to spend< / r e l>20 <arg2>some t ime wi th my k ids < / arg2>21 < / e x t r a c t i o n>22 < / s i m p l i f i c a t i o n>23 < / sentence>24 < / r e l a t i o n s>
Figure 4.2: Example of the output xml structure for the sentence “Last year I though it was agood sacrifice to make, because I wanted to spend some time with my kids.”
This structure consists of an XML file produced for every file evaluated, for each tool used.
Each XML file will have a list of sentence. The sentence stated is the original, taken from the
pre-processed file. After a sentence goes through each module, every sub-sentence created
is inserted into a simplification element, and if there are any secondary elements retrieved,
27
they are inserted in the respective entries. The four possible elements are: TC for Temporal
Context; BM for Belief Modifier; DC for Direct Consequence; and CC for Condition. Later, for
each extraction found, an extraction element is created, containing the three elements of
a relation, i.e., arg1, rel, and arg2, as well as the confidence value obtained by the tool,
and a binary classification value, used for evaluation. It is possible for a simplification to not
have any extractions, in which case no extraction element is created. If a sentence has no
simplifications, a simplification element is created with the text being equal to the sentence.
4.3 Sentence Cleaner
Sentence Cleaner is the module responsible for removing the noise. We identified noise as
repetitions, meaningless expressions, and interruptions.
Repetitions are removed with a regular expression. If the module identifies two equal words,
one after the other, with a possible punctuation signal in between, it removes one of the occur-
rences.
Meaningless expressions are identified, and consequently removed, by using a dictionary.
We manually compiled this dictionary using every meaningless expression found in the dataset,
together with possible variations. This dictionary contains expressions such as “and every-
thing“, “or something”, and “that’s good“.
Interruptions are recognized by the absence of final punctuation in a sentence. We restate
that there are two kinds of interruptions: the case where the theme of the sentence is not
shifted, as when an agreement sound is made; and the case where the theme shifts and the
sentence becomes incomplete. We only act upon the first case. To identify the correct cases,
the module searches for lines that do not end with any final punctuation. Then, it evaluates the
next sentence. If this sentence has a size equal to, or smaller than three words, the sentence
is removed, and the interrupted sentence is joined to the next sentence in line. If the number of
words is greater than three, we consider that the interruption may have enough information to
be considered a second case, and we ignore that sentence. Figure 4.3 shows an example of
this procedure.
>Friday we went downtown>Friday we went downtown to
dine at that new restaurant.>Right.>to dine at that new restaurant.
Figure 4.3: Example of an interruption being identified and resolved. Each > indicates a dif-ferent sentence. On the left, the second speaker interrupted the first with a comment. Thealgorithm identifies that sentence with less than three words, so it removes it and joins the firstand third sentences, creating the sentence seen on the right.
28
The last step in this module is to normalize the sentences, with a process similar to the one
described in 4.2.1. The objective is to capitalize every sentence, add a final punctuation where
there is none, and to correct errors that may have been created with the removal of content,
such as consecutive commas.
4.4 Sentence Divider
In Section 3.5, we defined two objectives for simplifying phrases: dividing compound sentences
into subsentences; and extracting complementary information from dependent clauses, in com-
plex sentences.
Compound sentences have two independent clauses, and are connected by a coordinat-
ing conjunction, e.g., “Joseph was christian and Patricia was buddhist”. Part of the division
algorithm is based on [32].
The complementary information sentences to be extracted from complex sentences is:
• Temporal Context: a reference to the time the action is being made (e.g., “We’ll play
football next thursday.”);
• Condition: an action necessary to validate a sentence (e.g., “My mother will go to the
beach if it doesn’t rain.”)
• Cause: the reason that provoked the main event of the sentence (e.g., “I don’t eat broccoli
because it stinks!’)
• Belief Modifier: statements that may change the truth of the sentence, according to their
context (e.g., “I don’t believe that Jack will finish the master’s.”)
To help identify these speech components, we use the Stanford CoreNLP software [26],
specifically the POS tagger and the dependency parser. Thus, before executing the next steps,
the Stanford Dependency Parser is used in each sentence, producing the necessary POS tags
and dependencies.
Using the cleaned sentences generated in the previous module, each sentence goes through
each step. First, the algorithm tries to divide the sentence. Each obtained subsentence is pro-
vided back to the first step, until it is not possible to divide the sentences anymore. Then, the
algorithm searches for the complementary information, reducing each subsentence to its main
components.
29
4.4.1 Coordinate Sentences
Coordinate sentences are recognizable by a coordinating conjunction connecting both clauses.
The dependency parser labels this conjunction with the POS tag CC, and labels the depen-
dency as CONJ. However, a coordinating conjunction is present in other syntactic elements, like
enumerations (e.g. “John went to the store to buy oranges and bananas”).
To divide only the correct sentences, the algorithm searches the tree for a CONJ depen-
dency. If the dependent is not a verb, the algorithm ignores the dependency and keeps search-
ing for another CONJ. If the dependent is a verb, the algorithm checks first if the length of the
possible subsentence is larger than one word, otherwise it will not be a valid sentence. If the
length is larger than one, the algorithm generates both subsentences, dividing at the <CC>
node (and removing it from each sentence, adding to the external structure). Finally, the al-
gorithm checks for the existence of a subject in the second subsentence, labeled by NSUBJ. If
there is not one, the NSUBJ from the main sentence is used.
4.4.2 Causal Sentences
Causal sentences are identified by a set of causal conjunctions, such as because, so that, in
order to, and variations of those. We compiled all the causal conjunctions in a dictionary. The
algorithm uses a regular expression to search for the causal conjunctions in the sentence. If
the conjunction is found in the middle of the sentence, the clause starting at that conjunction is
extracted and added to the XML structure with the tag of <DC>, for direct consequence. If the
conjunction is found in the beginning of the sentence, the algorithm searches for a comma, to
be able to limit both clauses. If it finds a comma, retrieves the clause from the conjunction to
the comma. Otherwise, the clause is not retrieved.
4.4.3 Temporal Context
The search for a temporal context is done in two steps, considering that it may exist under
different forms.
The first step is to use the dependency tree for the sentence. The dependency parser
labels possible temporal clauses with ADVCL, although this label is used for any kind of clause
modifier. To account for this, the algorithm verifies if the dependency is connected to a set of
temporal conjunctions (i.e., when, after, until, before, since, while, or once), that were compiled
in a dictionary. If the dependent is a temporal conjunction, the subtree with the temporal clause
is retrieved and added to the XML structure with the tag <TC>, for temporal context.
30
Nevertheless, the group of temporal references that is not mediated by a temporal conjunc-
tion, is not recognized as a temporal clause by the parser. To find those references, we use
a dictionary in conjunction with regular expressions, that take into account different kinds of
time references (e.g. three years ago, last saturday, tomorrow at noon, now, everyday, about
a week, etc.). The algorithm uses different combinations of patterns, such as a proposition
followed by a day of the week, or a number followed by a time quantity. If any such temporal
reference is found, it is extracted from the sentence, and again added to the XML structure. It
is possible to have several temporal references for the same sentence.
4.4.4 Condition
A condition is defined by the existence of an if word, followed by a dependent clause. To
retrieve the conditional clause, the dependency tree is used to properly identify the limits of the
condition. The algorithm will search the dependency graph for a MARK dependency, and check if
the dependent word is if, due to the same reasons expressed in the previous step, as the MARK
label is used to identify several types of finite subordinate clauses, and not only conditional
clauses. If a condition is found, it is extracted from the sentence and added to the structure
under the tag <CC>.
4.4.5 Belief Modifier
Finally, the last step searches for belief modifiers, again using a library in association with a
regular expression. The algorithm searches for a pattern that matches a pronoun (e.g., he,
they, etc.) followed by a conjugation of a specific verb, taken from a dictionary. The dictionary
was compiled by finding all the occurrences of a belief modifier in the dataset, and retrieving
the verb and its conjugations. It has verbs such as think, know, say, find, and decide. It is
likely that these verbs are used in other cases besides belief modifiers, therefore the algorithm
applies restrictions to the search. Among others, the modifier must be in the beginning of the
sentence, and there must be at least another verb present.
4.5 Relation Extractor
The last module in our system is responsible for extracting the relations. After each go-
ing through the previous steps, the application will prepare every subsentence generated (or
just the original sentence, if it was not simplified) and insert them into each of the OpenIE
31
tools, namely into ReVerb, OLLIE, OpenIE4, and Stanford OIE. Each application is config-
ured to generate as many extractions as it can find for a given sentence, along with the confi-
dence value. The extraction is retrieved with the form confidence - (argument 1; relation
phrase; argument 2). For each obtained extraction, an <extraction> element is inserted in
the XML structure along with its confidence value, and every element of the extraction popu-
lates its respective field in the structure as well. Later, for evaluation purposes, every extraction
is manually classified as 1 if correct, or 0 if incorrect, according to the criteria defined in Section
3.2.
32
Chapter 5
Evaluation
In this chapter, we analyze the results obtained by our implementation. Section 5.1 presents
a detailed analysis of the values obtained, in comparison with the initial values. In Section 5.2
we evaluate the problems generated by our system. Finally, in Section 5.3 we discuss and
conclude the study present in this work.
5.1 Results
This section presents the results that were obtained after processing the set of 500 sentences
with our system. Table 5.1 compares the total number of extractions, the precision, recall, and
F1-score of every extractor when running on our dataset, compared to the values obtained in
the ground truth.
An initial observation denotes that, first, the number of extractions decreased, and second,
the values improved in every extractor. The fact that the extractors are processing smaller
quantities of information in each sentence justifies the first observation. In chapter 3 we con-
cluded that the extractors produced more errors with longer sentences, and more noise. Our
application focused on simplifying the data, by preserving the essential information to relation,
and storing complementary information outside the sentence. It follows directly that there is
less information to extract relations from, which amounts for less extractions. In our case, that
consequence may be positive, if the majority of the extractions removed was incorrect. The
second observation supports that idea, as every extractor improved its results.
ReVerb demonstrated the smaller progress, with an increase of 0.02 in precision, 0.06
in recall, and 0.06 in F1-score. The heuristics used by ReVerb (described in section 2.4) are
limited to a pattern, which restraints the range of possible relations it can extract. It was possible
to observe that reducing noise and complexity had an impact, but it was small compared with
33
Extractions Precision Recall F1-score
Before After Before After Before After Before After
ReVerb 760 545 0.36 0.38 0.12 0.18 0.18 0.24
OLLIE 1030 805 0.37 0.44 0.17 0.25 0.23 0.32
Stanford OIE 1415 1261 0.27 0.39 0.24 0.35 0.25 0.37
OpenIE4 1190 975 0.42 0.51 0.15 0.26 0.22 0.34
Table 5.1: Number of extractions, precision, recall, and F1-score obtained for the ground truth,and for processed dialogue. The best result in each section is marked in bold.
the other three tools.
OLLIE exhibits an improvement of 0.07 in precision, 0.08 in recall, and 0.09 in F1-score.
OLLIE reveals balanced results overall, with the second best precision and a recall and F1-
scores close to OpenIE 4. OLLIE is also based on patterns, although they are learnt instead
of defined manually. Our study showed that noise and complexity interfere with the recognition
of those patterns. Nevertheless, as they are learnt, it is possible that, if OLLIE is trained with a
bootstrap of seeds from dialogue text, OLLIE can adapt to better recognize dialogue patterns.
Stanford OIE denotes an increase in precision of 0.12 points, 0.11 in recall, and 0.12 in
F1-score. Despite recall being high in the ground truth, comparatively with the other extractors,
it still improved significantly. As for precision, in our baseline we observed that, by being the
tool that generates more extractions, it consequently produced a large number of incorrect ex-
tractions, having the lowest precision of the four. By using our system, Stanford OIE increased
its precision to 0.39, which was an increase of 44%. Stanford OIE searches for extractions by
applying its own method of simplification, attempting to divide the sentences until it recognizes
a relation. The sentence simplification is a process prone to errors. This study showed that a
focus on reducing noise and complexity can have a denotative impact on such a process.
OpenIE4 shows a growth of 0.09 points in precision, 0.11 in recall, and 0.12 in F1-score. It
is the extractor with the highest precision, 0.51. OpenIE 4 main innovation was the extraction of
n-ary relations. The extractor works with the opposite idea of Staford OIE, as it uses semantic
role labeling (SRL) to retrieve as much information as possible, using concepts similar to ours
34
(e.g. time context), instead of simplifying. Our application is not entirely complementary to
this process, as it replaces parts of OpenIE 4 functions. This situation was observed in the
extractions, as most relations followed the traditional format, and there was only a small number
of n-ary extractions. Nevertheless, the use of SRL to extract relations proves to be effective,
comparatively with the other extractors, and the reduction of noise and complexity in the input
sentences are able to improve that process.
0.0 0.1 0.2 0.3 0.4 0.5Recall
0.0
0.2
0.4
0.6
0.8
1.0
Pre
cisi
on
ReVerb
Raw Dialogue - AUC = 0.081Processed Dialogue - AUC = 0.092
0.0 0.1 0.2 0.3 0.4 0.5Recall
0.0
0.2
0.4
0.6
0.8
1.0
Pre
cisi
on
OLLIE
Raw Dialogue - AUC = 0.154Processed Dialogue - AUC = 0.189
0.0 0.1 0.2 0.3 0.4 0.5Recall
0.0
0.2
0.4
0.6
0.8
1.0
Pre
cisi
on
Stanford OIE
Raw Dialogue - AUC = 0.152Processed Dialogue - AUC = 0.192
0.0 0.1 0.2 0.3 0.4 0.5Recall
0.0
0.2
0.4
0.6
0.8
1.0
Pre
cisi
on
OpenIE 4
Raw Dialogue - AUC = 0.164Processed Dialogue - AUC = 0.205
Figure 5.1: Comparisons between the Precision-Recall curves of each extractor, measured forthe ground truth (blue), and text processed by our application (red).
Figure 5.1 shows the differences between precision-recall curves between the ground truth
results, and the results generated by applying our system. Every extractor shows overall im-
provements. ReVerb has the smaller change, increasing only the number of correct extractions
for the top values of confidence. OLLIE demonstrates a larger improvement relative to pre-
cision, specially with medium values of confidence (in the range of 0.4 to 0.7). Stanford OIE
reduces the local minimum visible in the ground truth and has a smoother curve, improving
the precision for medium values of confidence. OpenIE4 shows the best results, confirmed
by its AUC, and presents an increase at all levels of confidence. In Figure, 5.2 we can see a
comparison of the precision-recall curve for every extractor side-by-side.
35
0.0 0.1 0.2 0.3 0.4 0.5Recall
0.0
0.2
0.4
0.6
0.8
1.0
Pre
cisi
on
ReVerb - AUC = 0.092OLLIE - AUC = 0.189Stanford OIE - AUC = 0.192OpenIE 4 - AUC = 0.205
Figure 5.2: Side-by-side comparison of all four precision-recall curves, measure for the dialogueprocessed by our application-
5.2 Problems
In order to analyze the reliability of our system, i.e., if the modules implemented were successful
in removing noise, and simplifying and dividing sentences, we manually annoted errors found in
the XML structures, and measured their frequency. We define these errors as situations where
the system did not produce the desired output, e.g., it did not divide a sentence where it was
suppose to, or it removed information wrongly identified as noise. Table 5.2 summarizes those
results.
The problems encountered with noise removal were mostly in regard to interruptions, in
situations where the algorithm incorrectly classified interruptions, and situations where it failed
to properly join the sentences.
The division of sentences generated errors in situations where it should have been able to
divide, but it was not capable of accomplishing it.
As for the retrieval of complementary information, the majority of problems occurred while
trying to identify the temporal context. Almost every error was due to a fault in the regular
expressions used, as they were not capable of identifying a large variety of temporal phrases.
36
Error Frequency
Noise
Repetitions 2%
Meaningless Expressions 1%
Interruptions 7%
Sentence Division 15%
Complementary
Information
Temporal Context 32%
Condition 0%
Cause 9%
Belief Modifier 3%
Table 5.2: Frequency of errors encountered in the XML structures, for each task.
5.3 Discussion
Comparatively to our benchmark, our system improved every result for each extractor. The first
conclusion is that the problems identified, namely the noise, and the size of sentences, have
a real impact on the identification of relations, and imply that extended studies on these topics
may improve further the results.
As for the comparative study, ReVerb proved to be weaker than the other three extractors,
not justifying added efforts in its improvement.
OLLIE was not the best extractor in neither metric, but demonstrated comparatively bal-
anced results. Due to its learning component, it is possible that if OLLIE is trained with a
bootstrap of relation seeds taken from dialogue transcriptions, its results can improve signifi-
cantly.
Stanford OIE and OpenIE presented the best result in recall and precision, respectively.
Stanford OIE having the best F1-score. Although their results are low if compared to the re-
ported results in structured text, they showed a significant improvement from the ground truth.
As a final conclusion, with this study we demonstrated that it is possible to improve the
extraction of relations from dialogue transcriptions, and we recommend further studies in this
field, specially using the extractors Stanford OIE and OpenIE 4
37
38
Chapter 6
Conclusions
Open Information Extraction (Open IE) is becoming an important process in the retrieval of
information and human text processing. Therefore, it is important to extend the scope of such
tools to spoken dialogue transcriptions.
AS far as we know, our work provides one of the first studies of the application of OpenIE
to dialogues texts. We were able to understand the limitations of OpenIE tools applied to this
type of text, and provided a first attempt at a framework capable of resolving the problems that
jeopardize the generation of good results. By removing noise from text, decreasing the size of
sentences by dividing them, and by identifying complementary information to the relation, that
can be stored outside the sentence, we were able to improve precision, recall, and F1-score
comparatively to our ground truth. Despite that these results are not yet ready to be useful in
real world applications, the improvements demonstrate that there is potential in pursuing further
studies on this topic.
We hope that this work provides a reference study to the application of Open IE to spoken
dialogue transcriptions.
6.1 Contributions
The main contributions of the research reported on this dissertation are:
• A comparative and detailed study of four Open IE tools, namely ReVerb, OLLIE, Stanford
OIE and OpenIE 4, when applied to dialogue text;
• An application to be used together with an OpenIE tool, that shows an improvement in
precision of 11 percentage points (pps), recall by up to 12 pps, and F1-score by 11 pps
for the best obtained results, which improved the quality of extractions;
39
• A definition of a structure in order to improve the extraction paradigm, capable of storing
complementary information to the relation.
6.2 Future Work
As future work we propose that the following points should be addressed:
• The creation of a larger labeled dialogue dataset can be crucial to the improvement of our
system.
• Extending our framework to include types of sentences not considered in this study,
namely questions.
• Improving sentence division by taking into account more types of subordinating clauses.
• Improving information removal by addressing more types of clauses, e.g., location.
40
Bibliography
[1] F. Wu and D. S. Weld. Open information extraction using wikipedia. In Proceedings of the
Annual Meeting of the Association for Computational Linguistics, pages 118–127, 2010.
[2] O. Etzioni, M. Banko, S. Soderland, and D. S. Weld. Open information extraction from the
web. Communication of ACM, pages 68–74, 2008.
[3] T. Lin, Mausam, and O. Etzioni. Identifying functional relations in web text. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing, pages 1266–
1276, 2010.
[4] A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing,
pages 1535–1545, 2011.
[5] Mausam, M. Schmitz, R. Bart, S. Soderland, and O. Etzioni. Open language learning
for information extraction. In Proceedings of the Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning, pages
523–534, 2012.
[6] S. Soderland, J. Gilmer, R. Bart, O. Etzioni, and D. Weld. Open information extraction to
kbp relation in 3 hours. In Text Analysis Conference - Knowledge Base Propagation, 2013.
[7] G. Angeli, M. J. Premkumar, and C. D. Manning. Leveraging linguistic structure for open
domain information extraction. In Proceedings of the Annual Meeting of the Association of
Computational Linguistics, 2015.
[8] A. Yates, M. Cafarella, M. Banko, O. Etzioni, M. Broadhead, and S. Soderland. Textrun-
ner: Open information extraction on the web. In Proceedings of the Annual Conference of
the North American Chapter of the Association for Computational Linguistics: Demonstra-
tions, pages 25–26, 2007.
[9] D. Jurafsky and J. H. Martin. Speech and Language Processing. Prentice-Hall, 2009.
41
[10] S. Petrov, D. Das, and R. T. McDonald. A universal part-of-speech tagset. CoRR, 2011.
[11] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of
english: The penn treebank. Computational Linguistic, pages 313–330, 1993.
[12] F. Karlsson, A. Voutilainen, J. Heikkila, and A. Anttila. Constraint Grammar: A Language-
independent System for Parsing Unrestricted Text, pages 165–284. Mouton de Gruyter,
1995.
[13] S. Abney. Part-of-speech tagging and partial parsing. In Corpus-Based Methods in Lan-
guage and Speech, pages 118–136, 1996.
[14] S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python, chapter 7,
pages 264–277. O’Reilly Media, Inc., 2009.
[15] P. G. Otero. The meaning of syntactic dependencies. Linguistik online, 2008.
[16] M.-C. d. Marneffe and C. D. Manning. Stanford Typed Dependencies Manual. Stanford
University, 2015.
[17] J. Nivre. Dependency grammar and dependency parsing. Technical report, Vaxjo Univer-
sity, 2005.
[18] S. Kubler, R. McDonald, and J. Nivre. Dependency Parsing. Morgan and Claypool, 2009.
[19] L. Marquez, X. Carreras, K. C. Litkowski, and S. Stevenson. Semantic role labeling: An
introduction to the special issue. Computational Linguistics, pages 145–159, 2008.
[20] D. Gildea and D. Jurafsky. Automatic labeling of semantic roles. Computational Linguistics,
pages 245–288, 2002.
[21] C. F. Baker, C. J. Fillmore, and J. B. Lowe. The berkeley framenet project. In Proceed-
ings of the Annual Meeting of the Association for Computational Linguistics and of the
International Conference on Computational Linguistics, 1998.
[22] M. Ciaramita and M. Surdeanu. Desrl: A linear-time semantic role labeling system. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing,
2008.
[23] K. K. Schuler. Verbnet: A Broad-coverage, Comprehensive Verb Lexicon. PhD thesis,
Philadelphia, PA, USA, 2005. AAI3179808.
42
[24] A. Bjorkelund, L. Hafdell, and P. Nugues. Multilingual semantic role labeling. In Proceed-
ings of the Conference on Computational Natural Language Learning: Shared Task, pages
43–48, 2009.
[25] R. Morante, V. V. Asch, and A. V. D. Bosch. Dependency parsing and semantic role labeling
as a single task. Proceedings of the Conference on Computational Natural Language
Learning: Shared Task, 2009.
[26] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The
Stanford CoreNLP natural language processing toolkit. In Association for Computational
Linguistics System Demonstrations, pages 55–60, 2014.
[27] D. Chen and C. Manning. A fast and accurate dependency parser using neural networks.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing,
pages 740–750, Oct. 2014.
[28] J. Nivre, M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald,
S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman. Universal dependencies
v1: A multilingual treebank collection. In Proceedings of the International Conference on
Language Resources and Evaluation (LREC 2016), may 2016.
[29] J. van Benthem. A brief history of natural logic. Technical Report PP-2008-05, University
of Amsterdam, 2008.
[30] D. Jurafsky, E. Shriberg, and D. Biasca. Switchboard swbd-damsl shallow-discourse-
function annotation coders manual, draft 13. 1997.
[31] K. Boyd, K. H. Eng, and C. D. P. Jr. Area under the precision-recall curve: Point estimates
and confidence intervals. In ECML/PKDD (3), volume 8190 of Lecture Notes in Computer
Science, pages 451–466. Springer, 2013.
[32] J. C. Collados. Splitting complex sentences for natural language processing applications:
Building a simplified spanish corpus. Procedia - Social and Behavioral Sciences, 95:464
– 472, 2013. ISSN 1877-0428.
[33] M. Bronzi, Z. Guo, F. Mesquita, D. Barbosa, and P. Merialdo. Automatic evaluation of rela-
tion extraction systems on large-scale. In Proceedings of the Joint Workshop on Automatic
Knowledge Base Construction and Web-scale Knowledge Extraction, pages 19–24, 2012.
43
[34] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld,
and A. Yates. Unsupervised named-entity extraction from the web: An experimental study.
In Artificial Intelligence, pages 91–134, 2005.
[35] J. Schmidek and D. Barbosa. Improving open relation extraction via sentence re-
structuring. In Proceedings of the International Conference on Language Resources and
Evaluation, 2014.
[36] O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam. Open information
extraction: The second generation. In Proceedings of the International Joint Conference
on Artificial Intelligence, pages 3–10, 2011.
[37] M. Recasens, M.-C. de Marneffe, and C. Potts. The life and death of discourse enti-
ties: Identifying singleton mentions. In Proceedings of NAACL-HLT 2013, pages 627–633,
2013.
[38] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky. Stanford’s
multi-pass sieve coreference resolution system at the conll-2011 shared task. In Proceed-
ings of the Conference on Computational Natural Language Learning: Shared Task, pages
28–34, 2011.
[39] E. Bengtson and D. Roth. Understanding the value of features for coreference resolution.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing,
pages 294–303, 2008.
[40] K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and
C. Manning. A multi-pass sieve for coreference resolution. In Proceedings of the Confer-
ence on Empirical Methods in Natural Language Processing, pages 492–501, 2010.
[41] V. Ng and C. Cardie. Improving machine learning approaches to coreference resolution.
In Proceedings of the Annual Meeting on Association for Computational Linguistics, 2002.
[42] W. M. Soon, H. T. Ng, and D. C. Y. Lim. A machine learning approach to coreference
resolution of noun phrases. Computational Linguistics, pages 521–544, 2001.
[43] P. Kingsbury and M. Palmer. From treebank to propbank. In Language Resources and
Evaluation, 2002.
[44] S. Soderland. Learning information extraction rules for semi-structured and free text. Ma-
chine Learning, pages 233–272, 1999.
44
[45] E. Riloff. Automatically generating extraction patterns from untagged text. In Proceedings
of the National Conference on Artificial Intelligence, pages 1044–1049, 1996.
45
46