View
52
Download
0
Category
Tags:
Preview:
DESCRIPTION
Automatic Discovery of Scenario-Level Patterns for Information Extraction. Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen. Outline. Information Extraction: background Problems in IE Prior Work: Machine Learning for IE Discover patterns from raw text Experimental results - PowerPoint PPT Presentation
Citation preview
NYU
ANLP-001
Automatic Discovery of Automatic Discovery of Scenario-Level Patterns for Scenario-Level Patterns for
Information ExtractionInformation Extraction
Roman Yangarber
Ralph Grishman
Pasi Tapanainen
Silja Huttunen
NYU2
Outline
Information Extraction: background Problems in IE Prior Work: Machine Learning for IE Discover patterns from raw text Experimental results Current work
NYU3
Quick Overview
What is Information Extraction ? Definition:
– finding facts about a specified class of events from free text
– filling a table in a data base (slots in a template)
Events: instances in relations, with many arguments
NYU4
– George Garrick, 40 years old, president of the
London-based European Information Services
Inc., was appointed chief executive officer of
Nielsen Marketing Research, USA.
Example: Management Succession
NYU5
Position Company Location Person Status
President European InformationServices, Inc.
London George Garrick Out
CEO Nielsen Marketing Research USA George Garrick In
Example: Management Succession
– George Garrick, 40 years old, president of the
London-based European Information Services
Inc., was appointed chief executive officer of
Nielsen Marketing Research, USA.
NYU6
discourse
sentence
Lexical Analysis
System Architecture: Proteus
Name Recognition
Partial Syntax
Scenario Patterns
Reference Resolution
Discourse Analyzer
Output Generation
Input Text
Extracted Information
NYU7
discourse
sentence
Lexical Analysis
System Architecture: Proteus
Name Recognition
Partial Syntax
Scenario Patterns
Reference Resolution
Discourse Analyzer
Output Generation
Input Text
Extracted Information
NYU8
Problems
Customization Performance
NYU9
Problems: Customization
To customize a system for a new extraction task, we have to develop– new patterns for new types of events
– word classes for the domain
– inference rules
This can be a large job requiring skilled labor–expense of customization limits uses of extraction
NYU10
Problems: Performance
Performance on event IE is limited On MUC tasks, typical top performance is
recall < 55%, precision < 75%
Errors propagate through multiple phases:–name recognition errors
–syntax analysis errors
–missing patterns
–reference resolution errors
–complex inference required
NYU11
Missing Patterns
As with many language phenomena–a few common patterns
–a large number of rare patterns
Rare patterns do not surface sufficiently often in limited corpus
Missing patterns make customization expensive and limit performance
Finding good patterns is necessary to improve customization and performance
Freq
Rank
NYU12
Prior Research
build patterns from examples– Yangarber ‘97
generalize from multiple examples: annotated text– Crystal, Whisk (Soderland), Rapier (Califf)
active learning: reduce annotation– Soderland ‘99, Califf ‘99
learning from corpus with relevance judgements– Riloff ‘96, ‘99
co-learning/bootstrapping– Brin ‘98, Agichtein ‘00
NYU13
Our Goals
Minimize manual labor required to construct pattern bases for new domain– un-annotated text
– un-classified text
– un-supervised learning
Use very large corpora -- larger than we could ever tag manually -- to improve coverage of patterns
NYU14
Principle: Pattern Density
If we have relevance judgements for documents in a corpus, for the given task,then the patterns which are much more frequent in relevant documents will generally be good patterns
Riloff (1996) finds patterns related to terrorist attacks
NYU15
Principle: Duality
Duality between patterns and documents:– relevant documents are strong indicators of
good patterns
– good patterns are strong indicators of relevant documents
NYU16
Outline of Procedure
Initial query: a small set of seed patterns which partially characterize the topic of interest
repeat
Initial query: a small set of seed patterns which partially characterize the topic of interest
Retrieve documents containing seed patterns: “relevant documents”
Initial query: a small set of seed patterns which partially characterize the topic of interest
Retrieve documents containing seed patterns: “relevant documents”
Rank patterns in relevant documents by frequency in relevant docs vs. overall frequency
Initial query: a small set of seed patterns which partially characterize the topic of interest
Retrieve documents containing seed patterns: “relevant documents”
Rank patterns in relevant documents by frequency in relevant docs vs. overall frequency
Add top-ranked pattern to seed pattern set
17
#1: pick seed pattern
Seed: < person retires >
18
#2: retrieve relevant documents
Seed: < person retires >
Fred retired....
Harry was named president.
Maki retired....
Yuki was named president.
Relevant documents Otherdocuments
19
#3: pick new pattern
Seed: < person retires >
< person was named president > appears in several relevant documents (top-ranked by Riloff metric)
Fred retired....
Harry was named president.
Maki retired....
Yuki was named president.
20
#4: add new pattern to pattern set
Pattern set: < person retires >
< person was named president >
NYU21
Pre-processing
For each document, find and classify names:– { person | location | organization | …}
Parse document– (regularize passive, relative clauses, etc.)
For each clause, collect a candidate pattern:tuple: heads of– [ subject
verb direct object object/subject complement locative and temporal modifiers… ]
NYU22
Experiment
Task: Management succession (as MUC-6) Source: Wall Street Journal Training corpus: ~ 6,000 articles Test corpus:
– 100 documents: MUC-6 formal training
– + 150 documents judged manually
NYU23
Experiment: two seed patterns
v-appoint = { appoint, elect, promote, name } v-resign = { resign, depart, quit, step-down }
Run discovery procedure for 80 iterations
Subject Verb Objectcompany v-appoint person
person v-resign -
NYU24
Evaluation
Look at discovered patterns– new patterns, missed in manual training
Document filtering Slot filling
NYU25
Discovered patterns
Subject Verb Objectcompany v-appoint person
person v-resign -
person succeed person
person be| become
president| officer| chairman| executive
company name president | …
person join | run| leave
company
person serve board | company
person leave post
NYU26
Evaluation: new patterns
Not found in manual training
Subject Verb Object Complementscompany bring person [as+officer]person come
| return- [to+company]
[as+officer]person rejoin company [as+officer]
person continue| remain| stay
- [as+officer]
person replace person [as+officer]person pursue interest -
NYU27
Evaluation: Text Filtering
How effective are discovered patterns at selecting relevant documents?
–IR-style
–documents matching at least one pattern
Pattern set Recall PrecisionSeed 11% 93%Seed+discovered 88% 81% (85)
NYU28
250 Test Documents (.5)
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60 70 80
Generation #
Recall
Precision
NYU29
Choice of Test Corpus (.5)
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Pre
cis
ion
250-Corpus
100-MUC Corpus
MUC-6 Players
NYU30
Evaluation: Slot filling
How effective are patterns within a complete IE system?
MUC-style IE on MUC-6 corpora
Caveat
training test
pattern set recall precision F recall precision F
seed 28 78 41
+ discovered 51 76 61
manual–MUC 54 71 62 47 70 56.40
manual–now 69 79 74 54.7 74.4 63.02
NYU31
Evaluation: Slot filling
How effective are patterns within a complete IE system?
MUC-style IE on MUC-6 corpora
Caveat
training test
pattern set recall precision F recall precision F
seed 28 78 41
+ discovered 51 76 61
manual–MUC 54 71 62 47 70 56.40
manual–now 69 79 74 54.7 74.4 63.02
NYU32
Evaluation: Slot filling
How effective are patterns within a complete IE system?
MUC-style IE on MUC-6 corpora
Caveat
training test
pattern set recall precision F recall precision F
seed 28 78 41
+ discovered 51 76 61
manual–MUC 54 71 62 47 70 56.40
manual–now 69 79 74 54.7 74.4 63.02
NYU33
Conclusion: Automatic discovery
Performance comparable to human(4-week development)
From un-annotated text: allows us to take advantage of very large corpora– redundancy
– duality
Will likely help wider use of IE
NYU34
NYU35
Good Patterns
U - universe of all documentsR - set of relevant documentsH= H(p) - set of documents where pattern p matched
Density criterion:
NYU36
Graded Relevance
Documents matching seed patterns considered 100% relevant Discovered patterns are considered less certain
Documents containing them are considered partially relevant
NYU37
document frequency in relevant documents overall document frequency
document frequency in relevant documents – (metrics similar to those used in Riloff-96)
Scoring Patterns
Recommended