CMPT 884, SFU, Martin Ester, 1-09 160 Information Extraction Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring 2009

CMPT 884, SFU, Martin Ester, 1-09 1

Information Extraction

Martin Ester

Simon Fraser University

School of Computing Science

CMPT 884

Spring 2009


Information Extraction

Outline• Introduction

motivation, applications, issues

• Entity extraction hand-coded, machine learning

• Relation extraction supervised, partially supervised

• Entity resolutionstring similarity, finding similar pairs, creating groups

• Future research

[Feldman 2006] [Agichtein & Sarawagi 2006]


Introduction

Motivation

• 80% of all human-generated data is natural language text

• search engines return whole documents, requiring the user

to read documents and manually extract relevant

information (entities, facts, . . .)

very time-consuming

• need for automatic extraction of such information from

collections of natural language text documents

information extraction (IE)


Introduction

Definitions• Entity: an object of interest such as a person or organization.• Attribute: a property of an entity such as its name, alias, descriptor, or type.• Relation: a relationship held between two or more entities such as Position of a Person in a Company.• Event: an activity involving several entities such as a terrorist act, aircraft crash, management change, new product introduction.


Introduction

Example


Introduction

Applications

• question answeringWho is the president of the US?

Where was Martin Luther born?• automatic creation of databases

e.g., database of protein localizations or adverse reactions to a drug• opinion mining

analyzing online product reviews to get user feedback


IntroductionChallenges

• Complexity of natural languagee.g., identifying word and sentence boundaries is fairly easy

in European languages, much harder in Chinese / Japanese• Ambiguity of natural language

e.g., homonyms• Diversity of natural language

many ways of expressing a given information, e.g. synonyms• Diversity of writing styles

e.g., scientific papers, newspaper articles, maintenance reports, emails, . . .


IntroductionChallenges

• names are hard to discover– impossible to enumerate– new candidates are generated all the time– hard to provide syntactic rules

• types of proper names– people– companies– products– genes- . . .


IntroductionArchitecture of IE System

Local analysisLocal analysis

Discourse (global)

analysis

Discourse (global)

analysis


Introduction

Knowledge Engineering Approach

•Extraction rules are hand-crafted by linguists in cooperation with domain experts.• Most of the work is done by inspecting a set of relevant documents.• Development of rule set is very time-consuming.• Requires substantial CS and domain expertise.• Rule sets are domain-specific, do not transfer to other domains.• Knowledge engineering (KE) approach often achieves higher accuracy than machine learning approach.


Introduction

Machine Learning Approach

• Automatically learn model („rules“) from annotated training corpus.• Techniques based on pure statistics and little linguistic knowledge.• No CS expertise required when building model.• However creating the annotated corpus is very laborious, since very large number of training examples needed.• Transfer to other domains is easier than KE approach.• Accuracy of machine learning (ML) approach is typically lower.


Introduction

Topics Not Covered

• co-reference resolution

e.g., article referencing a noun (entity) of another

sentence

• event extraction

event has type, actor, time . . .

• sentiment detection

a certain statement (opinion) is classified

as positive / negative


Entity Extraction

Lexical Analysis

• breaking up the input document into individual words = tokens • token: sequence of characters treated as a unit • punctuation marks also considered as token

e.g., „,“ (comma) • often, use regular expressions to define format of token


Entity Extraction

Syntactic Analysis• part-of-speech tagging [Charniak 1997]

marking up the tokens in a text as corresponding to a particular part of speech (POS), based on both its definition, as well as its context • coarse POS tags: e.g., N, V, A, Aux, ….• finer POS tags: - PRP: personal pronouns (you, me, she, he, them, him, her, …) - PRP$: possessive pronouns (my, our, her, his, …) - NN: singular common nouns (sky, door, theorem, …) - NNS: plural common nouns (doors, theorems, women, …) - NNP: singular proper names (Fifi, IBM, Canada, …)

- NNPS: plural proper names (Americas, Carolinas, …)


Entity Extraction

Syntactic Analysis

• Words often have more than one POS, e.g. back• The back door = JJ

• On my back = NN

• Win the voters back = RB

• Promised to back the bill = VB

• The POS tagging problem is to determine the POS tag for a particular instance of a word.• e.g., input: the lead paint is unsafe

output: the/Det lead/N paint/N is/V unsafe/Adj


Entity Extraction

Knowledge Engineering Approach [Chaudhuri 2005]

• hand-coded rules often relatively straightforward• easy to incorporate domain knowledge• require substantial CS expertise• example rule

<token> INITIAL</token>

<token>DOT </token>

<token>CAPSWORD</token>

<token>CAPSWORD</token> finds person names with a salutation and two

capitalized words, e.g. Dr. Laura Haas


Entity ExtractionKnowledge Engineering Approach

• a more complex example: conference name$wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|four teenth|fifteenth)";

my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";

my $ordinals="(?:$wordOrdinals|$numberOrdinals)";

my $confTypes="(?:Conference|Workshop|Symposium)";

my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spaces

my $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference

name for workshops (e.g. "VLDB Workshop ...")

my $connectors="(?:on|of)";

my $abbreviations="(?:\$[A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\$)"; # abbreviations like "(SIGMOD'06)"

my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abb reviations?)(?:\\n|\\r|\\.|<)"; . . .


Entity ExtractionMachine Learning Approach

• We can view the named entity extraction as a sequence classification problem: classify each word as belonging to one of the named entity classes or to the noname class.

• Class label of sequence element depends on neighboring ones.

• One of the most popular techniques for dealing with classifying sequences is Hidden Markov Models (HMM).

• Other popular ML method for entity extraction: Conditional Random Fields [Lafferty et al 2001].

• Requires large enough labeled (annotated) training dataset.


Entity Extraction

Hidden Markov Models [Rabiner 1989]

• HMM (Hidden Markov Model) is a finite state automaton

with stochastic state transitions and symbol emissions.

• The automaton models a probabilistic generative process.

• In this process a sequence of symbols is produced by

starting in an initial state, transitioning to a new state,

emitting a symbol selected by the state and repeating this

transition/emission cycle until a designated final state is

reached.

• Very successful in many sequence classification tasks.


Entity Extraction

Example

HMM for addresses


Entity Extraction

Hidden Markov Models

• T = length of the sequence of observations (training set)

• N = number of states in the model

• qt = the actual state at time t

• S = {S1,...SN} (finite set of possible states)

• V = {O1,...OM} (finite set of observation symbols)

• π = {πi} = {P(q1 = Si)} starting probabilities

• A = {aij}=P(qt+1= Si | qt = Sj) transition probabilities

• B = {bi(Ot)} = {P(Ot | qt = Si)} emission probabilities

• λ = (π, A, B) hidden Markov model


Entity Extraction

Hidden Markov Models

• How to find P( O | λ ): the probability of an observation sequence

given the HMM model?

forward-backward algorithm

• How to find λ that maximizes P( O |λ )?

This is the task of the training phase.

Baum-Welch algorithm

• How to find the most likely state trajectory given λ and O?

This is the task of the test phase.

Viterbi algorithm


Relation ExtractionExample

Apple's programmers "think different" on a "campus" in

Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore.

Microsoft's central headquarters in Redmond is home to almost every product group and division.

Organization Location

Microsoft

Apple Computer

Nike

Redmond

Cupertino

Portland

Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."


Relation Extraction

Introduction• No single source contains all the relations• Each relation appears on many web pages• There are repeated patterns in the way relations are represented on web pages

exploit redundancy• Components of relation appear “close” together

use context of occurrence of relation to determine patterns• pattern consists of constants (tokens) and variables (placeholders for entities)

• tuple: instance / occurrence of a relation


Relation ExtractionIntroduction

• Typically requires entity extraction (tagging) as preprocessing• Knowledge engineering approach - patterns defined over lexical items “<company> located in <location>” - patterns defined over parsed text “((Obj <company>) (Verb located) (*) (Subj <location>))”• Machine learning approach - learn rules/patterns from examples - partially-supervised: bootstrap from example tuples [Agichtein & Gravano 2000, Etzioni et al 2004]


Relation ExtractionSnowball [Agichtein & Gravano 2000]

• Exploit duality between patterns and tuples- find tuples that match a set of patterns- find patterns that match a lot of tuples bootstrapping approach

Initial Seed Tuples Occurrences of Seed Tuples

Tag Entities

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table


Relation Extraction

Snowball

• how to represent patterns of occurrences?

ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA

initial seed tuples

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, share ofRedmond-based Microsoft fell…

The Armonk-based IBM introduceda new line…

The combined company will operate

from Boeing’s headquarters in Seattle.

Intel, Santa Clara, cut prices of itsPentium processor.

occurrences of seed tuples


Relation ExtractionPatterns

• (extraction) pattern has format <left, tag1, middle, tag2, right>,

where tag1, tag2 are named-entity tags and left, middle, and right are vectors of weighted terms

•patterns derived directly from occurrences are too specific

< left , tag1 , middle , tag2 , right >

ORGANIZATION 's central headquarters in LOCATION is home to...

LOCATIONORGANIZATION{<'s 0.5>, <central 0.5> <headquarters 0.5>, < in 0.5>}

{<is 0.75>, <home 0.75> }


Relation ExtractionPattern Clusters

• cluster patterns, cluster centroids define patterns

{<servers 0.75><at 0.75>}

{<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>}

ORGANIZATION LOCATION

{<operate 0.75><from 0.75>}

{<’s 0.7> <headquarters 0.7> <in 0.7>}

ORGANIZATION LOCATION

Cluster 1

{<shares 0.75><of 0.75>}

{<- 0.75> <based 0.75> }

{<fell 1>}

{<the 1>}

{<- 0.75> <based 0.75> }

ORGANIZATION

LOCATION

{<introduced 0.75> <a 0.75>}

LOCATION

ORGANIZATION

Cluster 2


Relation Extraction

Evaluation of Patterns

• How good are new extraction patterns?•Measure their performance through their accuracy vs. the initial seed tuples (ground truth).

extraction with pattern “ORGANIZATION, LOCATION”

Boeing, Seattle, said… PositiveIntel, Santa Clara, cut prices… Positiveinvest in Microsoft, New York-based Negativeanalyst Jane Smith said

ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA


Relation Extraction

Evaluation of Patterns

• Trust only patterns with high “support” and

“confidence”,

i.e. that produce many correct (positive) tuples and only

a

few false (negative) tuples.

• conf(p) = pos(p)/(pos(p)+neg(p))

where p denotes a pattern and pos(p), neg(p) denote the

numbers of positive, negative tuples produced


Relation ExtractionEvaluation of Tuples

• Trust only tuples that match many patterns.

• Suppose candidate tuple t matches patterns p1 and p2. What is the probability that t is a valid tuple?• Assume matches of different patterns are independent events.• Pr[t matches p1 and t is not valid] = 1-conf(p1)

Pr[t matches p2 and t is not valid] = 1-conf(p2)

Pr[t matches {p1,p2} and t is not valid] = (1-conf(p1))(1-conf(p2))

Pr[t matches {p1,p2} and t is valid] = 1 - (1-conf(p1))(1-conf(p2))

• If tuple t matches a set of patterns P

conf(t) = 1 - p in P(1-conf(p))


Relation Extraction

Snowball Algorithm

1. Start with seed set R of tuples2. Generate set P of patterns from R

compute support and confidence for each pattern in P

discard patterns with low support or confidence3. Generate new set T of tuples matching patterns P

compute confidence of each tuple in T add to R the tuples t in T with conf(t)>threshold.4. go back to step 2


Relation Extraction

Discussion

•bootstrapping approach requires only a relatively small number of training tuples (semi-supervised)•is effective for binary, 1:1 relations•bootstrapping approach has been adopted by lots of subsequent work•pattern evaluation is heuristic and has no theory behind

Statistical Snowball, WWW 09•what about n-ary relations?•what about 1:m relations?


Entity Resolution

Introduction


Entity ResolutionIntroduction

•Entity resolution - map entity mentions to the corresponding entities

- entities stored in database or ontology•Challenges

- large lists with multiple noisy mentions of the same entity - no single attribute to order or cluster likely duplicates while separating them from similarbut different entities - need to depend on fuzzy and computationally expensive string similarity functions.


Entity ResolutionIntroduction

•Typical approach- define string similarity

numeric attributes are easy to compare, hard are string attributes

needs to perform approximate matches - find similar pairs of entities

- create groupsfrom duplicate entity pairs(clustering)


Entity ResolutionString Similarity

•Token-based

Jaccard

TF-IDF cosine similarities

suitable for large documents

• Character-based

Edit-distance and variants like Levenshtein, Jaro-Winkler

Soundex

suitable for short strings with spelling mistakes

• Hybrids


Entity ResolutionToken-Based String Similarity

•Tokens/words

‘AT&T Corporation’ ‘AT&T’ , ‘Corporation’

• Similarity: various measures of overlap of two sets S,T

• Jaccard(S,T) = |S∩T|/|S∪T|

• Example

S = ‘AT&T Corporation’ ‘AT&T’ , ‘Corporation’

T = ‘AT&T Corp’ ‘AT&T’ , ‘Corp.’

Jaccard(S,T) = 1/3

• Variants: weights attached with each token



• Sets transformed to vectors with each term as dimension

• Cosine similarity:

dot-product of two vectors each normalized to unit length

cosine of angle between them• Term weight = TF/IDF

log (tf+1) * log idf wheretf : frequency of ‘term’ in a document didf : number of documents / number of documents

containing ‘term’

rare ‘terms’ are more important



• Widely used in traditional IR

• Example:

‘AT&T Corporation’, ‘AT&T Corp’ or ‘AT&T Inc’

low weights for ‘Corporation’,’Corp’,’Inc’,

higher weight for ‘AT&T’


Entity ResolutionCharacter-Based String Similarity

• Given two strings, S,T, edit(S,T):

minimum cost sequence of operations to transform S to T.

• Character operations: I (insert), D (delete), R (Replace).

• Example: edit(Error,Eror) = 1, edit(great,grate) = 2

• Dynamic programming algorithm to compute edit();

• Several variants (gaps,weights)

becomes NP-complete

• Varying costs of operations: can be learnt

• Suitable for common typing mistakes on small strings


Entity Resolution

Find Duplicate Pairs

• Input: a large list of entities with string attributes

• Output: all pairs (S,T) of entities which satisfy a similarity

criteria such as

Jaccard(S,T) > 0.7

Edit-distance(S,T) < k

•Naive method: for each record pair, compute similarity score

• I/O and CPU intensive, not scalable to millions of entities

• Goal: reduce O(n2) cost to O(n*w), where w << n

• Reduce number of pairs on which similarity is computed


Entity Resolution


• Method: filter and refinement

• Use inexpensive filter to filter out as many pairs as possible e.g. EditDistance(s,t) ≤ d →

|q-grams(s) ∩ q-grams(t)| ≥ max(|s|,|t|) - (d-1)*q - 1

q-gram: subsequence of q consecutive characters

e.g. 3-grams for ‘AT&T Corporation’

{‘AT&’,’T&T’,’&T ‘, ‘T C’,’ Co’, ’orp’,’rpo’,’por’,’ora’,’rat’,’ati’,’tio’,’ion’}

• If a pair (s, t) does not satisfy the filter, it cannot satisfy the

similarity criteria

e.g., |q-grams(s) ∩ q-grams(t)| < max(|s|,|t|) - (d-1)*q - 1 →

EditDistance(s,t) > d


Entity Resolution


• Do not have to apply the filter to all pairs of entities

use index to retrieve subset of entities that share

q-grams

• Compute the expensive similarity function only to

pairs

that survive the filter step

e.g. EditDistance(s,t)


Entity Resolution

Create Groups of Duplicates

• Given pairs of duplicate entities

• Group them such that each group corresponds to

one entity

• Many clustering algorithms have been applied

• Number of clusters hard to specify in advance

• Ground truth may be available for some entity pairs

semi-supervised clustering


Entity Resolution

Create Groups of Duplicates

• Agglomerative clustering:

repeatedly merge closest clusters

• Definition of closeness of clusters subject to tuning

Average/Max/Min similarity

• Efficient implementations possible using special

data

structures


Entity Resolution

Challenges

• Collective entity resolution

consider relationships between entities and propagate

resolution decisions along these relationships

use Markov Logic Networks [Parag & Domingos 2005]

• Mapping to existing background knowledge

ontology of real world entities may be given

map entities / clusters of entities to ontology entries

k-nearest neighbor methods


Information ExtractionReferences

•Eugene Agichtein, Luis Gravano: Snowball: Extracting Relations Snowball: Extracting Relations from Large Plain-Text Collections, ACM DL, 2000 from Large Plain-Text Collections, ACM DL, 2000 •Eugene Agichtein, Sunita Sarawagi: Scalable Information Extraction and Integration, Tutorial KDD 2006• Eugene Charniak: Statistical Techniques for Natural Language Parsing“, AI Magazine 18(4), 1997• S. Chaudhuri, R. Ramakrishnan, and G. Weikum. Integrating db and ir technologies: What is the sound of one hand clapping?, CIDR 2005• Ronen Feldman:Information Extraction: Theory and Practice, Tutorial ICML 2006


Information ExtractionReferences

• John Lafferty, Andrew McCallum, Fernando Pereira: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, ICML 2001• L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77(2), 1989

Documents

CMPT 884, SFU, Martin Ester, 1-09 160 Information Extraction Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring 2009