Upload
franklin-howard
View
213
Download
1
Embed Size (px)
Citation preview
Artificial Intelligence Research CentreProgram Systems InstituteRussian Academy of Science
152020 Pereslavl-ZalesskyRussia
INEX: Tools for Information Extraction
Artificial Intelligence Research CentreProgram Systems InstituteRussian Academy of Science152020 Pereslavl-ZalesskyRussia+7 48535 [email protected]
AiReC
Information extraction
Objective: extract meaningful information of
a pre-specified type from (typically large amounts of) texts for further analytical purposes
Output: data structures of a pre-specified
format (filled scenario templates)
Examples Sports report: <winner>, <loser>,
<score>, <location>, <date>… Database on rental
accommodation opportunities: <location>,<renting price>, <bedrooms number>, <phone number>…
Possible IE application scenarios:
inference of new information (knowledge acquisition)query formulation and answering in human-computer systemsautomatic generation of abstracts and summariesvisualization of document content, etc.
The `Newsmaking’ task <newsmaker> <type of newsmaker> (person or
organization) <message> <type of message> (original,
cited, a reference to another newsmaker)
IE system architecture
Sem
anti
c an
alys
er
Collection oftexts
Lin
gu
isti
c p
roce
sso
r
Microsyntactic analysis
Coreference resolution
Morphological analysis
Macrosyntactic analysis
Applying information extraction rules
Tokenisation & sentence segmentation
Input textLinguistic
information
Results
Informationextraction rules
Coreferenceresolution rules
Filtering
Merging partial results
Named entity recognition
Disambiguation (partial)
Ext
ract
ion
of t
ask-
spec
ific
info
rmat
ion
Tokenisation & sentence segmentation
Tokenisation identification of words, punctuation
marks, delimiters, special characters
Sentence segmentationrecognizing sentence boundaries
Morphological analysis
maps every word-form of the input text to (a) canonical form(s) recognizes the word's morphological properties
Results are typically ambiguous.
Filtering
reduces the text to be subjected to further processing to potentially relevant portions
Disambiguation
a side effect of other processes (e.g., microsyntactic analysis)
a stand-alone stage
Microsyntactic analysis
identifies noun phrases (NP) identifies some regularly formed
constructions (numbers, dates, personal proper names)
Macrosyntactic analysis
identifies clause boundaries constructs clause hierarchy within
a sentence
Named entity recognizer
identifies proper names assigns semantic features to
certain items
Information extraction rules
a domain knowledge representation formalism (scenario templates)
a set of patterns to identify template elements in a text (covering the many possible ways to talk about the target event elements)
IE pattern includes:
a set of rules that define how to retrieve this pattern in a text
a set of constraints imposed on textual elements to fit into a particular slot of the target
Coreference Resolver
recognizes different occurrences of the same entity in a text
Merging partial results
merging partially filled templates to produce a final, maximally filled template