1 Information Extraction using HMMs Sunita Sarawagi

Preview:

Citation preview

1

Information Extraction using HMMs

Sunita Sarawagi

2

IE by text segmentationSource: concatenation of structured elements with

limited reordering and some missing fields Example: Addresses, bib records

House number Building Road City Zip

4089 Whispering Pines Nobel Drive San Diego CA 92122

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Author Year Title JournalVolume

Page

State

3

Hidden Markov Models Doubly stochastic models

Efficient dynamic programming algorithms exist for

Finding Pr(S) The highest probability path P that

maximizes Pr(S,P) (Viterbi) Training the model

(Baum-Welch algorithm)

S2

S4

S10.9

0.5

0.50.8

0.2

0.1

S3

A

C

0.6

0.4

A

C

0.3

0.7

A

C

0.5

0.5

A

C

0.9

0.1

4

Input features Content of the element

Specific keywords like street, zip, vol, pp, Properties of words like capitalization, parts of

speech, number? Inter-element sequencing Intra-element sequencing Element length External database

Dictionary words Semantic relationship between words

Frequency constraints

5

IE with Hidden Markov Models Probabilistic models for IE

Title

Journal

Author 0.9

0.5

0.50.8

0.2

0.1

Transition probabilitie

s

Year

A

B

C

0.6

0.3

0.1

X

B

Z

0.4

0.2

0.4

Y

A

C

0.1

0.1

0.8

Emission probabiliti

es

dddd

dd

0.8

0.2

6

HMM Structure Naïve Model: One state per element

Nested model

Each element another HMM

7

Comparing nested models Naïve: Single state per tag

Element length distribution: a, a2, a3,… Intra-tag sequencing not captured

Chain: Element length distribution:

Each length gets its own parameter Intra-tag sequencing captured Arbitrary mixing of dictionary,

Eg. “California York” Pr(W|L) not modeled well.

Parallel path: Element length distribution: each length gets a

parameter Separates vocabulary of different length

elements, (limited bigram model)

8

Embedding a HMM in a state

9

Bigram model of Bikel et al. Each inner model a detailed bigram model

First word: conditioned on state and previous state Subsequent words conditioned on previous word and

state Special “start” and “end” symbols that can be thought

Large number of parameters (Training data order~60,000 words in the smallest

experiment) Backing off mechanism to previous simpler “parent”

models (lambda parameters to control mixing)

10

Separate HMM per tag Special prefix and suffix states to capture the start and

end of a tag

S2

S4

S1Prefix Suffix

Road name

S2

S4

S1

S3

Prefix Suffix

Building name

11

HMM Dictionary For each word (=feature), associate the

probability of emitting that word Multinomial model

Features of a word, example,

part of speech, capitalized or not type: number, letter, word etc

Maximum entropy models (McCallum 2000), other exponential models

Bikel: <word,feature> pairs

12

000.. . . .999

3 -d ig i ts

00000 .. . .99999

5 -d ig i ts

0 ..99 0000 ..9999 000000 ..

O th e rs

N u m b e rs

A .. ..z

C h a rs

a a ..

M u lt i -le tte r

W o rds

. , / - + ? #

D e lim ite rs

A ll

Feature Hierarchy

13

Learning model parameters When training data defines unique path through

HMM Transition probabilities

Probability of transitioning from state i to state j =

number of transitions from i to j total transitions from state i

Emission probabilities Probability of emitting symbol k from state i =

number of times k generated from i number of transition from I

When training data defines multiple path: A more general EM like algorithm (Baum-Welch)

14

Smoothing Two kinds of missing symbols:

Case-1: Unknown over the entire dictionary Case-2: Zero count in some state

Approaches: Laplace smoothing: ki + 1

m + |T| Absolute discounting

P(unknown) proportional to number of distinct tokens P(unknown) = (k’) x (number of distinct symbols) P(known) = (actual probability)-(k’), k’ is a small fixed constant, case 2 smaller than case 1

15

Smoothing (Cont.) Smoothing parameters derived from data

Partition training data into two parts Train on part-1 Use part-2 to map all new tokens to UNK and treat it

as new word in vocabulary OK for case-1, not good for case-2. Bikel et al use

this method for case-1. For case-2 zero counts are backed off to 1/(Vocab-size)

16

Using the HMM to segment Find highest probability path through the HMM. Viterbi: quadratic dynamic programming algorithm

House

ot

Road

City

Pin

115 Grant street Mumbai 400070

House

Road

City

Pin

115 Grant ……….. 400070

ot

House

Road

City

Pin

House

Road

Pin

17

Most Likely Path for a Given Sequence

The probability that the path is taken and the sequence is generated:

L

iiNL iiiaxbaxx

1001 11

)()...,...Pr(

Lxx ...1

N ...0

transition

probabilities

emission

probabilities

18

Example

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

begin end

0.5

0.5

0.2

0.8

0.4

0.6

0.1

0.90.2

0.8

0 5

4

3

2

1

6.03.08.04.02.04.05.0

)C()A()A(),AACPr( 35313111101

abababa

A 0.4C 0.1G 0.2T 0.3

A 0.2C 0.3G 0.3T 0.2

19

Finding the most probable path: the Viterbi algorithm

define to be the probability of the most probable path accounting for the first i characters of x and ending in state k

)(ivk

we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state

can define recursively can use dynamic programming to find efficiently

)(LvN

)(LvN

20

Finding the most probable path: the Viterbi algorithm

initialization:

1)0(0 v

k statesother for ,0)0( kv

21

The Viterbi algorithm recursion for emitting states (i =1…L):

klkk

ill aivxbiv )1(max)()(

klkk

l aivi )1(maxarg)(ptr keep track of most probable path

22

The Viterbi algorithm

to recover the most probable path, follow pointers back starting at

termination:

L

kNkk

aLv )( maxargL

kNkk

aLvx )( max),Pr(

23

Database Integration Augment dictionary

Example: list of Cities Assigning probabilities is a problem

Exploit functional dependencies Example

Santa Barbara -> USA Piskinov -> Georgia

24

2001 University Avenue, Kendall Sq. Piskinov, Georgia

2001 University Avenue, Kendall Sq., Piskinov, Georgia

House number

Road NameCity StateArea

2001 University Avenue, Kendall Sq., Piskinov, Georgia

House number Road Name Area CountryCity

25

Frequency constraints Including constraints of the form: the same tag

cannot appear in two disconnected segments Eg: Title in a citation cannot appear twice Street name cannot appear twice

Not relevant for named-entity tagging kinds of problems

26

Constrained Viterbi

klkk

ill aivxbiv )1(max)()( Original Viterbi

Modified Viterbi

….

27

Comparative Evaluation Naïve model – One state per element in the

HMM Independent HMM – One HMM per element; Rule Learning Method – Rapier Nested Model – Each state in the Naïve model

replaced by a HMM

Results: Comparative Evaluation

The Nested model does best in all three cases

(from Borkar 2001)

Dataset instances

Elements

IITB student

Addresses

2388 17

Company

Addresses

769 6

US

Addresses

740 6

29

Results: Effect of Feature Hierarchy

Feature Selection showed at least a 3% increase in accuracy

30

Results: Effect of training data size

HMMs are fast Learners.

We reach very close to the maximum accuracy with just 50 to 100 addresses

31

HMM approach: summary

Inter-element sequencing

Intra-element sequencing

Element length

Characteristic words

Non-overlapping tags

Outer HMM transitions

Inner HMM

Multi-state Inner HMM

Dictionary

Global optimization

Recommended