View
222
Download
0
Category
Tags:
Preview:
Citation preview
1
Information Extraction using HMMs
Sunita Sarawagi
2
IE by text segmentationSource: concatenation of structured elements with
limited reordering and some missing fields Example: Addresses, bib records
House number Building Road City Zip
4089 Whispering Pines Nobel Drive San Diego CA 92122
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.
Author Year Title JournalVolume
Page
State
3
Hidden Markov Models Doubly stochastic models
Efficient dynamic programming algorithms exist for
Finding Pr(S) The highest probability path P that
maximizes Pr(S,P) (Viterbi) Training the model
(Baum-Welch algorithm)
S2
S4
S10.9
0.5
0.50.8
0.2
0.1
S3
A
C
0.6
0.4
A
C
0.3
0.7
A
C
0.5
0.5
A
C
0.9
0.1
4
Input features Content of the element
Specific keywords like street, zip, vol, pp, Properties of words like capitalization, parts of
speech, number? Inter-element sequencing Intra-element sequencing Element length External database
Dictionary words Semantic relationship between words
Frequency constraints
5
IE with Hidden Markov Models Probabilistic models for IE
Title
Journal
Author 0.9
0.5
0.50.8
0.2
0.1
Transition probabilitie
s
Year
A
B
C
0.6
0.3
0.1
X
B
Z
0.4
0.2
0.4
Y
A
C
0.1
0.1
0.8
Emission probabiliti
es
dddd
dd
0.8
0.2
6
HMM Structure Naïve Model: One state per element
Nested model
Each element another HMM
7
Comparing nested models Naïve: Single state per tag
Element length distribution: a, a2, a3,… Intra-tag sequencing not captured
Chain: Element length distribution:
Each length gets its own parameter Intra-tag sequencing captured Arbitrary mixing of dictionary,
Eg. “California York” Pr(W|L) not modeled well.
Parallel path: Element length distribution: each length gets a
parameter Separates vocabulary of different length
elements, (limited bigram model)
8
Embedding a HMM in a state
9
Bigram model of Bikel et al. Each inner model a detailed bigram model
First word: conditioned on state and previous state Subsequent words conditioned on previous word and
state Special “start” and “end” symbols that can be thought
Large number of parameters (Training data order~60,000 words in the smallest
experiment) Backing off mechanism to previous simpler “parent”
models (lambda parameters to control mixing)
10
Separate HMM per tag Special prefix and suffix states to capture the start and
end of a tag
S2
S4
S1Prefix Suffix
Road name
S2
S4
S1
S3
Prefix Suffix
Building name
11
HMM Dictionary For each word (=feature), associate the
probability of emitting that word Multinomial model
Features of a word, example,
part of speech, capitalized or not type: number, letter, word etc
Maximum entropy models (McCallum 2000), other exponential models
Bikel: <word,feature> pairs
12
000.. . . .999
3 -d ig i ts
00000 .. . .99999
5 -d ig i ts
0 ..99 0000 ..9999 000000 ..
O th e rs
N u m b e rs
A .. ..z
C h a rs
a a ..
M u lt i -le tte r
W o rds
. , / - + ? #
D e lim ite rs
A ll
Feature Hierarchy
13
Learning model parameters When training data defines unique path through
HMM Transition probabilities
Probability of transitioning from state i to state j =
number of transitions from i to j total transitions from state i
Emission probabilities Probability of emitting symbol k from state i =
number of times k generated from i number of transition from I
When training data defines multiple path: A more general EM like algorithm (Baum-Welch)
14
Smoothing Two kinds of missing symbols:
Case-1: Unknown over the entire dictionary Case-2: Zero count in some state
Approaches: Laplace smoothing: ki + 1
m + |T| Absolute discounting
P(unknown) proportional to number of distinct tokens P(unknown) = (k’) x (number of distinct symbols) P(known) = (actual probability)-(k’), k’ is a small fixed constant, case 2 smaller than case 1
15
Smoothing (Cont.) Smoothing parameters derived from data
Partition training data into two parts Train on part-1 Use part-2 to map all new tokens to UNK and treat it
as new word in vocabulary OK for case-1, not good for case-2. Bikel et al use
this method for case-1. For case-2 zero counts are backed off to 1/(Vocab-size)
16
Using the HMM to segment Find highest probability path through the HMM. Viterbi: quadratic dynamic programming algorithm
House
ot
Road
City
Pin
115 Grant street Mumbai 400070
House
Road
City
Pin
115 Grant ……….. 400070
ot
House
Road
City
Pin
House
Road
Pin
17
Most Likely Path for a Given Sequence
The probability that the path is taken and the sequence is generated:
L
iiNL iiiaxbaxx
1001 11
)()...,...Pr(
Lxx ...1
N ...0
transition
probabilities
emission
probabilities
18
Example
A 0.1C 0.4G 0.4T 0.1
A 0.4C 0.1G 0.1T 0.4
begin end
0.5
0.5
0.2
0.8
0.4
0.6
0.1
0.90.2
0.8
0 5
4
3
2
1
6.03.08.04.02.04.05.0
)C()A()A(),AACPr( 35313111101
abababa
A 0.4C 0.1G 0.2T 0.3
A 0.2C 0.3G 0.3T 0.2
19
Finding the most probable path: the Viterbi algorithm
define to be the probability of the most probable path accounting for the first i characters of x and ending in state k
)(ivk
we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state
can define recursively can use dynamic programming to find efficiently
)(LvN
)(LvN
20
Finding the most probable path: the Viterbi algorithm
initialization:
1)0(0 v
k statesother for ,0)0( kv
21
The Viterbi algorithm recursion for emitting states (i =1…L):
klkk
ill aivxbiv )1(max)()(
klkk
l aivi )1(maxarg)(ptr keep track of most probable path
22
The Viterbi algorithm
to recover the most probable path, follow pointers back starting at
termination:
L
kNkk
aLv )( maxargL
kNkk
aLvx )( max),Pr(
23
Database Integration Augment dictionary
Example: list of Cities Assigning probabilities is a problem
Exploit functional dependencies Example
Santa Barbara -> USA Piskinov -> Georgia
24
2001 University Avenue, Kendall Sq. Piskinov, Georgia
2001 University Avenue, Kendall Sq., Piskinov, Georgia
House number
Road NameCity StateArea
2001 University Avenue, Kendall Sq., Piskinov, Georgia
House number Road Name Area CountryCity
25
Frequency constraints Including constraints of the form: the same tag
cannot appear in two disconnected segments Eg: Title in a citation cannot appear twice Street name cannot appear twice
Not relevant for named-entity tagging kinds of problems
26
Constrained Viterbi
klkk
ill aivxbiv )1(max)()( Original Viterbi
Modified Viterbi
….
27
Comparative Evaluation Naïve model – One state per element in the
HMM Independent HMM – One HMM per element; Rule Learning Method – Rapier Nested Model – Each state in the Naïve model
replaced by a HMM
Results: Comparative Evaluation
The Nested model does best in all three cases
(from Borkar 2001)
Dataset instances
Elements
IITB student
Addresses
2388 17
Company
Addresses
769 6
US
Addresses
740 6
29
Results: Effect of Feature Hierarchy
Feature Selection showed at least a 3% increase in accuracy
30
Results: Effect of training data size
HMMs are fast Learners.
We reach very close to the maximum accuracy with just 50 to 100 addresses
31
HMM approach: summary
Inter-element sequencing
Intra-element sequencing
Element length
Characteristic words
Non-overlapping tags
Outer HMM transitions
Inner HMM
Multi-state Inner HMM
Dictionary
Global optimization
Recommended