1 Information Extraction using HMMs Sunita Sarawagi

Information Extraction using HMMs

Sunita Sarawagi

IE by text segmentationSource: concatenation of structured elements with

limited reordering and some missing fields Example: Addresses, bib records

House number Building Road City Zip

4089 Whispering Pines Nobel Drive San Diego CA 92122

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Author Year Title JournalVolume

Hidden Markov Models Doubly stochastic models

Efficient dynamic programming algorithms exist for

Finding Pr(S) The highest probability path P that

maximizes Pr(S,P) (Viterbi) Training the model

(Baum-Welch algorithm)

0.50.8

Input features Content of the element

Specific keywords like street, zip, vol, pp, Properties of words like capitalization, parts of

speech, number? Inter-element sequencing Intra-element sequencing Element length External database

Dictionary words Semantic relationship between words

Frequency constraints

IE with Hidden Markov Models Probabilistic models for IE

Journal

Author 0.9

0.50.8

Transition probabilitie

Emission probabiliti

HMM Structure Naïve Model: One state per element

Nested model

Each element another HMM

Comparing nested models Naïve: Single state per tag

Element length distribution: a, a2, a3,… Intra-tag sequencing not captured

Chain: Element length distribution:

Each length gets its own parameter Intra-tag sequencing captured Arbitrary mixing of dictionary,

Eg. “California York” Pr(W|L) not modeled well.

Parallel path: Element length distribution: each length gets a

parameter Separates vocabulary of different length

elements, (limited bigram model)

Embedding a HMM in a state

Bigram model of Bikel et al. Each inner model a detailed bigram model

First word: conditioned on state and previous state Subsequent words conditioned on previous word and

state Special “start” and “end” symbols that can be thought

Large number of parameters (Training data order~60,000 words in the smallest

experiment) Backing off mechanism to previous simpler “parent”

models (lambda parameters to control mixing)

Separate HMM per tag Special prefix and suffix states to capture the start and

end of a tag

S1Prefix Suffix

Road name

Prefix Suffix

Building name

HMM Dictionary For each word (=feature), associate the

probability of emitting that word Multinomial model

Features of a word, example,

part of speech, capitalized or not type: number, letter, word etc

Maximum entropy models (McCallum 2000), other exponential models

Bikel: <word,feature> pairs

000.. . . .999

3 -d ig i ts

00000 .. . .99999

5 -d ig i ts

0 ..99 0000 ..9999 000000 ..

O th e rs

N u m b e rs

A .. ..z

C h a rs

a a ..

M u lt i -le tte r

W o rds

. , / - + ? #

D e lim ite rs

Feature Hierarchy

Learning model parameters When training data defines unique path through

HMM Transition probabilities

Probability of transitioning from state i to state j =

number of transitions from i to j total transitions from state i

Emission probabilities Probability of emitting symbol k from state i =

number of times k generated from i number of transition from I

When training data defines multiple path: A more general EM like algorithm (Baum-Welch)

Smoothing Two kinds of missing symbols:

Case-1: Unknown over the entire dictionary Case-2: Zero count in some state

Approaches: Laplace smoothing: ki + 1

m + |T| Absolute discounting

P(unknown) proportional to number of distinct tokens P(unknown) = (k’) x (number of distinct symbols) P(known) = (actual probability)-(k’), k’ is a small fixed constant, case 2 smaller than case 1

Smoothing (Cont.) Smoothing parameters derived from data

Partition training data into two parts Train on part-1 Use part-2 to map all new tokens to UNK and treat it

as new word in vocabulary OK for case-1, not good for case-2. Bikel et al use

this method for case-1. For case-2 zero counts are backed off to 1/(Vocab-size)

Using the HMM to segment Find highest probability path through the HMM. Viterbi: quadratic dynamic programming algorithm

115 Grant street Mumbai 400070

115 Grant ……….. 400070

Most Likely Path for a Given Sequence

The probability that the path is taken and the sequence is generated:

iiNL iiiaxbaxx

1001 11

)()...,...Pr(

Lxx ...1

N ...0

transition

probabilities

emission

probabilities

Example

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

begin end

0.90.2

6.03.08.04.02.04.05.0

)C()A()A(),AACPr( 35313111101

abababa

A 0.4C 0.1G 0.2T 0.3

A 0.2C 0.3G 0.3T 0.2

Finding the most probable path: the Viterbi algorithm

define to be the probability of the most probable path accounting for the first i characters of x and ending in state k

we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state

can define recursively can use dynamic programming to find efficiently

Finding the most probable path: the Viterbi algorithm

initialization:

1)0(0 v

k statesother for ,0)0( kv

The Viterbi algorithm recursion for emitting states (i =1…L):

ill aivxbiv )1(max)()(

l aivi )1(maxarg)(ptr keep track of most probable path

The Viterbi algorithm

to recover the most probable path, follow pointers back starting at

termination:

aLv )( maxargL

aLvx )( max),Pr(

Database Integration Augment dictionary

Example: list of Cities Assigning probabilities is a problem

Exploit functional dependencies Example

Santa Barbara -> USA Piskinov -> Georgia

2001 University Avenue, Kendall Sq. Piskinov, Georgia

2001 University Avenue, Kendall Sq., Piskinov, Georgia

House number

Road NameCity StateArea

2001 University Avenue, Kendall Sq., Piskinov, Georgia

House number Road Name Area CountryCity

Frequency constraints Including constraints of the form: the same tag

cannot appear in two disconnected segments Eg: Title in a citation cannot appear twice Street name cannot appear twice

Not relevant for named-entity tagging kinds of problems

Constrained Viterbi

ill aivxbiv )1(max)()( Original Viterbi

Modified Viterbi

Comparative Evaluation Naïve model – One state per element in the

HMM Independent HMM – One HMM per element; Rule Learning Method – Rapier Nested Model – Each state in the Naïve model

replaced by a HMM

Results: Comparative Evaluation

The Nested model does best in all three cases

(from Borkar 2001)

Dataset instances

Elements

IITB student

Addresses

2388 17

Company

Addresses

Results: Effect of Feature Hierarchy

Feature Selection showed at least a 3% increase in accuracy

Results: Effect of training data size

HMMs are fast Learners.

We reach very close to the maximum accuracy with just 50 to 100 addresses

HMM approach: summary

Inter-element sequencing

Intra-element sequencing

Element length

Characteristic words

Non-overlapping tags

Outer HMM transitions

Inner HMM

Multi-state Inner HMM

Dictionary

Global optimization

1 Information Extraction using HMMs Sunita Sarawagi

Documents

An Introduction to Data Mining Prof. S. Sudarshan CSE Dept, IIT Bombay Most slides courtesy: Prof. Sunita Sarawagi School of IT, IIT Bombay

Structured learning Sunita Sarawagi IIT Bombay sunita TexPoint fonts used in EMF. Read the TexPoint manual before you delete

Information Extraction Sunita Sarawagi IIT Bombay sunita

Efficient Domain Generalization via Sunita Sarawagi Common …14-14-00... · 2020. 12. 26. · Contributions We provide a principled understanding of existing Domain Generalization

Data warehousing, data analysis and OLAP Sunita Sarawagi sunita@iitb.ac.in

Sunita Sarawagi IIT Bombay sunitasunita/talks/graphical.pdf · Examples of Joint Distributions So far Naive Bayes: P(x 1;:::x djy) , d is large.Assume conditional independence. Multivariate

Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri

HMMS Arts Programs

Frequent itemset mining and temporal extensions Sunita Sarawagi sunita@it.iitb.ac.in sunita

The Magical Seed of Neelpur - By Aditi Sarawagi

MAP Estimation in Binary MRFs using Bipartite Multi-Cuts Sashank J. Reddi Sunita Sarawagi Sundar Vishwanathan Indian Institute of Technology, Bombay TexPoint

Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases

Sunita Sarawagi IIT Bombay it.iitb.ernet/~sunita

Sunita Sarawagi Sunita@iitb.ac.in Data mining and Machine Learning

Hidden Markov Models (HMMs)

Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac.in

Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay

Hmms Spring2013

HMMs Speech Recognition