18
CMSC724: Information Extraction Amol Deshpande University of Maryland, College Park April 18, 2013 Amol Deshpande CMSC724: Information Extraction

CMSC724: Information Extraction - UMD Department … rules depend on each other (application of one rule enables another rule) Policies: Specify how to resolve conflicts (largest

Embed Size (px)

Citation preview

CMSC724: Information Extraction

Amol Deshpande

University of Maryland, College Park

April 18, 2013

Amol Deshpande CMSC724: Information Extraction

August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example: Answering Queries Over Text

For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Name Title Organization

Bill Gates CEO Microsoft

Bill Veghte VP Microsoft

Richard Stallman Founder Free Soft..

PEOPLE

Select NameFrom PEOPLEWhere Organization = ‘Microsoft’

Bill Gates

Bill Veghte

(from William Cohen’s IE tutorial, 2003)

Overview

Goal: automatically extract structured information fromunstructured textApplications:

News tracking, Customer care, Data cleaning, Classifiedads, PIM, Citation databases, Opinion databases

Evolution:Early systems: rule-based with manually coded rulesThen: automatically learning rules from examplesStatistical learning

Generating models based on HMMsConditional models based on maximum entropyConditional random fields... and so on.

Amol Deshpande CMSC724: Information Extraction

Overview

Types of structure extractedEntities

Named entities: names of persons, locations, companiesDisease names, protein names, paper titles, journal names

RelationshipsBinary vs multi-way

Adjectives describing entitiesStructures: lists, tables, ontologies

Types of unstructured sourcesGranularity of extraction: record/sentences vsparagraphs/documentsHeterogeneity

Machine generated pages: extractors often called wrappersPartially-structure sourcesOpen-ended sources

Amol Deshpande CMSC724: Information Extraction

Overview

Types of structure extractedEntities

Named entities: names of persons, locations, companiesDisease names, protein names, paper titles, journal names

RelationshipsBinary vs multi-way

Adjectives describing entitiesStructures: lists, tables, ontologies

Types of unstructured sourcesGranularity of extraction: record/sentences vsparagraphs/documentsHeterogeneity

Machine generated pages: extractors often called wrappersPartially-structure sourcesOpen-ended sources

Amol Deshpande CMSC724: Information Extraction

Overview

Input resources that are often available/usedStructured databases like ACM DL or DBLPLabeled dataPreprocessing libraries: NLP tools

ChallengesAccuracy: Precision vs RecallEfficiency

Amol Deshpande CMSC724: Information Extraction

Overview

Input resources that are often available/usedStructured databases like ACM DL or DBLPLabeled dataPreprocessing libraries: NLP tools

ChallengesAccuracy: Precision vs RecallEfficiency

Amol Deshpande CMSC724: Information Extraction

Entity Extraction: Rule-based

Very useful for simple extraction tasks, and widely used"Big Data" may make them even more viable today

Typical rule-based system:A collection of rulesPolicies dictating how to use them

Basically pattern-matchingWith some context around it

Amol Deshpande CMSC724: Information Extraction

Entity Extraction: Rule-based286 Entity Extraction: Rule-based Methods

Fig. 2.1 A subset of rules for identifying company names paraphrased from the Named

Entity recognizer in Gate.

2.1.4 Rules for Multiple Entities

Some rules take the form of regular expressions with multiple slots, each

representing a di!erent entity so that this rule results in the recogni-

tion of multiple entities simultaneously. These rules are more useful for

record oriented data. For example, the WHISK [195] rule-based system

has been targeted for extraction from structured records such as med-

ical records, equipment maintenance logs, and classified ads. This rule

rephrased from [195] extracts two entities, the number of bedrooms and

rent, from an apartment rental ad.

({Orthography type = Digit}):Bedrooms ({String =“BR”}) ({}*)

({String =“$”}) ({Orthography type = Number}):Price ! Number

of Bedrooms = :Bedroom, Rent =: Price

Amol Deshpande CMSC724: Information Extraction

August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Hand Coded Rule Example: Conference Name# These are subordinate patterns$wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)";my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";my $ordinals="(?:$wordOrdinals|$numberOrdinals)";my $confTypes="(?:Conference|Workshop|Symposium)";my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spacesmy $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference name for workshops (e.g. "VLDB Workshop ...")my $connectors="(?:on|of)";my $abbreviations="(?:\\([A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\\))"; # Conference abbreviations like "(SIGMOD'06)"# The actual pattern we search for. A typical conference name this pattern will find is# "3rd International Conference on Blah Blah Blah (ICBBB-05)"my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abbreviations?)(?:\\n|\\r|\\.|<)";############################## ################################# Given a <dbworldMessage>, look for the conference pattern##############################################################lookForPattern($dbworldMessage, $fullNamePattern);########################################################## In a given <file>, look for occurrences of <pattern># <pattern> is a regular expression#########################################################sub lookForPattern {my ($file,$pattern) = @_;

Entity Extraction: Rule-based

Usually very large number of rulesMay lead to conflicts/overlaps etc.Often rules depend on each other (application of one ruleenables another rule)

Policies:Specify how to resolve conflicts (largest match etc)Order the rulesEncode the rules in a Finite State Machine

Amol Deshpande CMSC724: Information Extraction

Entity Extraction: Rule-based

How to learn rules?Domain expert specifiedLearn from a training dataset

290 Entity Extraction: Rule-based Methods

set of rules R1, . . . ,Rk such that the action part of each rule is one of

three action types described in Sections 2.1.2 through 2.1.4. The body

of each rule R will match a fraction S(R) of the data segments in the

N training documents. We call this fraction the coverage of R. Of all

segments R covers, the action specified by R will be correct only for a

subset S!(R) of them. The ratio of the sizes of S!(R) and S(R) is the

precision of the rule. In rule learning, our goal is to cover all segments

that contain an annotation by one or more rules and to ensure that

the precision of each rule is high. Ultimately, the set of rules have

to provide good recall and precision on new documents. Therefore, a

trivial solution that covers each entity in D by its own very specific

rule is useless even if this rule set has 100% coverage and precision.

To ensure generalizability, rule-learning algorithms attempt to define

the smallest set of rules that cover the maximum number of training

cases with high precision. However, finding such a size optimal rule set

is intractable. So, existing rule-learning algorithms follow a greedy hill

climbing strategy for learning one rule at a time under the following

general framework.

(1) Rset = set of rules, initially empty.

(2) While there exists an entity x ! D not covered by any rule

in Rset

(a) Form new rules around x.

(b) Add new rules to Rset.

(3) Post process rules to prune away redundant rules.

The main challenge in the above framework is in figuring out how

to create a new rule that has high overall coverage (and therefore gen-

eralizes), is nonredundant given rules already existing in Rset, and has

high precision. Several strategies and heuristics have been proposed for

this. They broadly fall under two classes: bottom-up [42, 43, 60], or,

top-down [170, 195]. In bottom-up a specific rule is generalized, and in

top-down a general rule is specialized as elaborated next. In practice,

the details of rule-learning algorithms are much more involved and we

will present only an outline of the main steps.

Issues:How to create a new rule given the already existing rulesDifferent approaches – mostly heuristics

Amol Deshpande CMSC724: Information Extraction

August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Popular Machine Learning Methods for IE

� Naive Bayes

� SRV [Freitag-98], Inductive Logic Programming

� Rapier [Califf & Mooney-97]

� Hidden Markov Models [Leek, 1997]

� Maximum Entropy Markov Models [McCallum et al, 2000]

� Conditional Random Fields [Lafferty et al, 2000]

� Implementations available:

� Mallet (Andrew McCallum)

� crf.sourceforge.net (Sunita Sarawagi)

� MinorThird minorthird.sourceforge.net (William Cohen)

For details: [Feldman, 2006 and Cohen, 2004]

Entity Extraction: Statistical Methods

Token-based methodsTokenize the sentencesFor each token, try to assign it a label among a fixed set oflabels Y

3.1 Token-level Models 297

method of handling multi-word entities is to treat extraction as a seg-

mentation problem where each segment is an entity. We call these

segment-level methods and discuss them in Section 3.2.

Sometimes, decompositions based on tokens or segments, fail to

exploit the global structure in a source document. In such cases,

context-free grammars driven by production rules, are more e!ective.

We discuss these in Section 3.3.

We discuss algorithms for training and deploying these models in

Sections 3.4 and 3.5, respectively.

We use the following notation in this section. We denote the given

unstructured input as x and its tokens as x1 · · ·xn, where n is the num-

ber of tokens in string. The set of entity types we want to extract from

x is denoted as E.

3.1 Token-level Models

This is the most prevalent of statistical extraction methods on plain

text data. The unstructured text is treated as a sequence of tokens

and the extraction problem is to assign an entity label to each token.

Figure 3.1 shows two example sequences of eleven and nine words each.

We denote the sequence of tokens as x = x1 · · ·xn. At the time of extrac-

tion each xi has to be classified into one of a set Y of labels. This gives

rise to a tag sequence y = y1 · · ·yn.

The set of labels Y comprise of the set of entity types E and a special

label “other” for tokens that do not belong to any of the entity types.

For example, for segmenting an address record into its constituent

Fig. 3.1 Tokenization of two sentences into sequence of tokens.

Define a set of features (with many features) —-fi(y , x , i) : y ∈ Y , x ∈ X —- e.g., f1(y , x , i) = [[ xi equals"Fagin" ]] . [[ y = Author ]] —- e.g., f3(y , x , i) = [[ xi matchesINITIAL_DOT ]] . [[ y = Author ]] —- e.g., f5(y , x , i) = [[ xi inPerson_dictionary ]] . [[ y = Author ]]

Amol Deshpande CMSC724: Information Extraction

Entity Extraction: Statistical Methods

Token-based methodsAssigning labels:

Basic option: Assign independentlyLearn a classifier (e.g., SVM) using the training dataWon’t exploit any correlations across tokens

Left-to-right:Assign labels going from left-to-rightUse the label on left to predict the label on the right token

Conditional Random Fields (CRFs)Widely used for this and other tasksA special type of graphical model with tractable inferencecomplexity

Amol Deshpande CMSC724: Information Extraction

Entity Extraction: Statistical Methods

Segment-based methodsFeatures defined over segments comprising multiple tokensSegment-level features hard to capture in the token-basedmethods

e.g., f (yi , yi−1, x , 3, 5) = [[ x3x4x5 appears in a list ofjournals]] . [[ yi = journal]]e.g., f (yi , yi−1, x , 3, 5) = MAX TF-IDF-similarly(x3x4x5, J) . [[yi = journal]]

Amol Deshpande CMSC724: Information Extraction

August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Relation Extraction: Disease Outbreaks

� Extract structured relations from text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

ZaireEbolaMay 1995

U.S.PneumoniaFeb. 1995

July 1995

Jan. 1995

Date LocationDisease Name

U.K.Mad Cow Disease

EthiopiaMalaria

Information Extraction System (e.g., NYU’s Proteus)

Disease Outbreaks in The New York Times

August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Relation Extraction� Typically require Entity Tagging as preprocessing

� Knowledge Engineering

� Rules defined over lexical items� “<company> located in <location>”

� Rules defined over parsed text� “((Obj <company>) (Verb located) (*) (Subj <location>))”

� Proteus, GATE, …

� Machine Learning-based

� Learn rules/patterns from examples

Dan Roth 2005, Cardie 2006, Mooney 2005, …

� Partially-supervised: bootstrap from “seed” examplesAgichtein & Gravano 2000, Etzioni et al., 2004, …

� Recently, hybrid models [Feldman2004, 2006]