18
Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Embed Size (px)

Citation preview

Page 1: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Artificial Intelligence Research CentreProgram Systems InstituteRussian Academy of Science

152020 Pereslavl-ZalesskyRussia

Page 2: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

INEX: Tools for Information Extraction

Artificial Intelligence Research CentreProgram Systems InstituteRussian Academy of Science152020 Pereslavl-ZalesskyRussia+7 48535 [email protected]

AiReC

Page 3: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Information extraction

Objective: extract meaningful information of

a pre-specified type from (typically large amounts of) texts for further analytical purposes

Output: data structures of a pre-specified

format (filled scenario templates)

Page 4: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Examples Sports report: <winner>, <loser>,

<score>, <location>, <date>… Database on rental

accommodation opportunities: <location>,<renting price>, <bedrooms number>, <phone number>…

Page 5: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Possible IE application scenarios:

inference of new information (knowledge acquisition)query formulation and answering in human-computer systemsautomatic generation of abstracts and summariesvisualization of document content, etc.

Page 6: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

The `Newsmaking’ task <newsmaker> <type of newsmaker> (person or

organization) <message> <type of message> (original,

cited, a reference to another newsmaker)

Page 7: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

IE system architecture

Sem

anti

c an

alys

er

Collection oftexts

Lin

gu

isti

c p

roce

sso

r

Microsyntactic analysis

Coreference resolution

Morphological analysis

Macrosyntactic analysis

Applying information extraction rules

Tokenisation & sentence segmentation

Input textLinguistic

information

Results

Informationextraction rules

Coreferenceresolution rules

Filtering

Merging partial results

Named entity recognition

Disambiguation (partial)

Ext

ract

ion

of t

ask-

spec

ific

info

rmat

ion

Page 8: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Tokenisation & sentence segmentation

Tokenisation identification of words, punctuation

marks, delimiters, special characters

Sentence segmentationrecognizing sentence boundaries

Page 9: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Morphological analysis

maps every word-form of the input text to (a) canonical form(s) recognizes the word's morphological properties

Results are typically ambiguous.

Page 10: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Filtering

reduces the text to be subjected to further processing to potentially relevant portions

Page 11: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Disambiguation

a side effect of other processes (e.g., microsyntactic analysis)

a stand-alone stage

Page 12: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Microsyntactic analysis

identifies noun phrases (NP) identifies some regularly formed

constructions (numbers, dates, personal proper names)

Page 13: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Macrosyntactic analysis

identifies clause boundaries constructs clause hierarchy within

a sentence

Page 14: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Named entity recognizer

identifies proper names assigns semantic features to

certain items

Page 15: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Information extraction rules

a domain knowledge representation formalism (scenario templates)

a set of patterns to identify template elements in a text (covering the many possible ways to talk about the target event elements)

Page 16: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

IE pattern includes:

a set of rules that define how to retrieve this pattern in a text

a set of constraints imposed on textual elements to fit into a particular slot of the target

Page 17: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Coreference Resolver

recognizes different occurrences of the same entity in a text

Page 18: Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky Russia

Merging partial results

merging partially filled templates to produce a final, maximally filled template