1
Open Information Extraction Aims to extract asserted propositions from unstructured text: “Barack Obama, a former U.S president, was born in Hawaii.” 1. (Barack Obama; was born in; Hawaii) 2. (a former U.S. president; was born in; Hawaii) BIO Encoding Each tuple is encoded with respect to a single predicate, where argument labels indicate their position in the tuple. Barack A0−B Obama A0−I , O a A0−B former A0−I U.S. A0−I president A0−I , O was P−B born P−I in P−I Hawaii A1−B RNN-OIE: Bi-LSTM Sequence Tagger Inspired by recent state of the art in Semantic Role Labelling (Zhou and Xu, 2015; He et al., 2017). Features: Concatenated pretrained embeddings of current word and target predicate (identified by a verb POS). Decoding: Ignores malformed spans - if an A0-I label is not preceded by A0-I or A0-B, we treat it as O. Confidence: Estimated for an extraction E by Π "∈$ %(') Supervised Open Information Extraction Gabriel Stanovsky 2,3 , Julian Michael 2 , Luke Zettlemoyer 2 , Ido Dagan 1 github.com/gabrielStanovsky/supervised-oie Training Data We used the QA-SRL to Open IE conversion (OIE2016, Stanovsky and Dagan, 2016) to train our model. This consists of verbal propositions, automatically extracted from template QA-SRL annotations. Augmenting with QAMR annotations In addition, we converted the Question-Answer Meaning Representation bank (Michael et al, 2018 – Come see our poster tomorrow!), consisting of free-form question-answer format over a wide range of predicates. The conversion was achieved with heuristics over the QA parse tree. Resulting Training Corpus Evaluation We compare RNN-OIE against top performing Open IE systems: RNN-OIE performs competitively across all test sets, outperforming all other systems on the larger test sets. QAMR improves performance, especially on more diverse test sets. Run-time Analysis Rnn-OIE is able to leverage GPU architecture to achieve a 10 times improvement over the previous fastest system (measured in sentences per second). 1 Bar-Ilan University 3 Allen Institute for Artificial Intelligence 2 University of Washington Labels repeat when a single predicate participates in multiple propositions Multi-word predicates are allowed Test Data We test our model on four publicly available Open IE corpora, following (Schneider et al., 2017). Dataset Domain #Sentences #Tuples OIE2016 News, Wiki 3200 5077 QAMR Wikinews, wiki 3300 12952 Dataset Domain #Sentences #Tuples OIE2016 News, Wiki 3200 1729 WEB News, Web 500 461 NYT News, Wiki 222 222 PENN Mixed 100 51 ClausIE PropS Open IE4 RNN-OIE CPU 4.07 4.59 15.38 13.51 GPU --- --- --- 149.25 Error Analysis An analysis of 100 gold propositions which were missed by all systems (i.e., recall errors) reveals that they all struggle with noun relations, sentence-level inference and long or informal sentences. Noun predicate Sentence level inference Long sentences Nominalization Noisy / Informal PP attachment Andre Agassi did a similar thing in his hometown of Las Vegas. (Andre Agassi; hometown; Las Vegas) John Steinbeck also earned a lot of awards, one being the Pulitzer Prize. (John Steinbeck; earned; Pulitzer Prize) 38 40 23 28 34 45 22 28 42 45 24 28 45 23 9 21 48 47 25 26 OIE2016 WEB NYT PENN AREA UNDER PR CURVE ClausIE PropS Open IE4 RNN-OIE-verb RNN-OIE-QAMR

Supervised Open Information Extraction - Gabriel Stanovsky€¦ · relations, sentence-level inference and long or informal sentences. Noun predicate Sentence level inference Long

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supervised Open Information Extraction - Gabriel Stanovsky€¦ · relations, sentence-level inference and long or informal sentences. Noun predicate Sentence level inference Long

Open Information ExtractionAims to extract asserted propositions from unstructured text:

“Barack Obama, a former U.S president, was born in Hawaii.”

1. (Barack Obama; was born in; Hawaii)2. (a former U.S. president; was born in; Hawaii)

BIO Encoding Each tuple is encoded with respect to a single predicate, where argument labels indicate their position in the tuple.

BarackA0−B ObamaA0−I ,O aA0−B formerA0−I U.S.A0−IpresidentA0−I ,O wasP−B bornP−I inP−I HawaiiA1−B

RNN-OIE: Bi-LSTM Sequence Tagger Inspired by recent state of the art in Semantic Role Labelling (Zhou and Xu, 2015; He et al., 2017).

Features: Concatenated pretrained embeddings of current word and target predicate (identified by a verb POS).

Decoding: Ignores malformed spans - if an A0-I label is not preceded by A0-I or A0-B, we treat it as O.

Confidence: Estimated for an extraction E by Π"∈$%(')

Supervised Open Information ExtractionGabriel Stanovsky2,3, Julian Michael2, Luke Zettlemoyer2, Ido Dagan1

github.com/gabrielStanovsky/supervised-oie

Training DataWe used the QA-SRL to Open IE conversion (OIE2016, Stanovskyand Dagan, 2016) to train our model. This consists of verbal propositions, automatically extracted from template QA-SRL annotations.

Augmenting with QAMR annotationsIn addition, we converted the Question-Answer Meaning Representation bank (Michael et al, 2018 – Come see our poster tomorrow!), consisting of free-form question-answer format over a wide range of predicates. The conversion was achieved with heuristics over the QA parse tree.

Resulting Training Corpus

EvaluationWe compare RNN-OIE against top performing Open IE systems:

RNN-OIE performs competitively across all test sets, outperforming all other systems on the larger test sets. QAMR improves performance, especially on more diverse test sets.

Run-time AnalysisRnn-OIE is able to leverage GPU architecture to achieve a 10 times improvement over the previous fastest system (measured in sentences per second).

1Bar-Ilan University

3Allen Institute for Artificial Intelligence

2University of Washington

Labels repeat when a single predicate participates in multiple propositions

Multi-word predicates are allowed

Test DataWe test our model on four publicly available Open IE corpora, following (Schneider et al., 2017).

Dataset Domain #Sentences #Tuples

OIE2016 News, Wiki 3200 5077

QAMR Wikinews, wiki 3300 12952

Dataset Domain #Sentences #Tuples

OIE2016 News, Wiki 3200 1729

WEB News, Web 500 461

NYT News, Wiki 222 222

PENN Mixed 100 51

ClausIE PropS Open IE4 RNN-OIE

CPU 4.07 4.59 15.38 13.51

GPU --- --- --- 149.25

Error AnalysisAn analysis of 100 gold propositions which were missed by all systems (i.e., recall errors) reveals that they all struggle with noun relations, sentence-level inference and long or informal sentences.

Noun predicate

Sentence level inference

Long sentences

Nominalization

Noisy / Informal

PP attachment

Andre Agassi did a similar thing in his hometown of Las Vegas.

(Andre Agassi; hometown; Las Vegas)

John Steinbeck also earned a lot of awards, one being the Pulitzer Prize.(John Steinbeck; earned; Pulitzer Prize)

38 40

23

28

34

45

22

28

42 45

24 28

45

23

9

21

48 47

25 26

OIE2016 WEB NYT PENN

AREA UNDER PR CURVE ClausIE PropS Open IE4 RNN-OIE-verb RNN-OIE-QAMR