Upload
neena
View
35
Download
2
Embed Size (px)
DESCRIPTION
Hidden Markov Models for Information Extraction. Recent Results and Current Projects Joseph Smarr & Huy Nguyen Advisor: Chris Manning. HMM Approach to IE. HMM states are associated with a semantic type background-text , person-name , etc. Constrained EM learns transitions and emissions - PowerPoint PPT Presentation
Citation preview
Hidden Markov Modelsfor Information Extraction
Recent Results and Current Projects
Joseph Smarr & Huy NguyenAdvisor: Chris Manning
HMM Approach to IE
HMM states are associated with a semantic type background-text, person-name, etc.
Constrained EM learns transitions and emissions
Viterbi alignment of a document marks tagged ranges of text with the same semantic typeExtract range with highest probability
2 3 4 5 6 2
Speaker is Huy Nguyen this week
Existing Work
Leek (97 [UCSD MS thesis]) Early results, fixed structures
Freitag & McCallum (99, 00) Grow complex structures
Limitations of Existing Work
Only one field extracted at a time Relative position of fields is ignored
e.g. authors usually come before titles in citations Similar-looking fields aren’t competed for
e.g. acquired company vs. purchasing company
Simple model of unknown words Use <UNK> for all words seen less than N
times No separation of content and context
e.g. can’t plug in generic date extractors, etc.
Current Research Goals
Flexibly train and combine extractors for multiple fields of information Learn structures suited for individual fields
Can be recombined and reused with many HMMs Learn intelligent context structures to link
targets Canonical ordering of fields Common prefixes and suffixes
Construct merged HMM for actual extraction Context/target split makes search problem
tractable Transitions between models are compiled out in
merge
Current Research Goals
Richer models for handling unknown words Estimate likelihood of novel words in each state Featural decomposition for finer-grained probs
e.g. Nguyen UNK[Capitalized, No-numbers] Character-level models for higher precision
e.g. phone numbers, room numbers, dates, etc.
Conditional training to focus on extraction task Classical joint estimation often wastes states
modeling patterns in English background text Conditional training is slower, but only rewards
structure that increases labeling accuracy
Learning Target Structures
Goal: Learn flexible structure tailored to composition of particular fields
Representation: Disjunction of multi-state chains
Learning method: Collect and isolate all examples of the target
field Initialization: single state Search operators (greedy search):
extend current chain(s) Start a new chain
Stopping criteria: MDL score
Example Target HMM: dlramt
START END
13.5240100
mlnbillionU.S.
Canadian
dlrsdollarsyenpesos
undisclosedwithheld
amount
Learning Context Structures
Goal: Learn structure to connect multiple target HMMs
Captures canonical ordering of fields Identifies prefix and suffix patterns around targets
Initialization: Background state connected to each target Find minimum # words between each target type in
corpus Connect targets directly if distance is 0 Add context state between targets if they’re close
Search operators (greedy search): Add prefix/suffix between background and target Lengthen an existing chain Start a new chain (by splitting an existing one)
Stopping criteria: MDL score
Example of Context HMM
Background
ContextPurchaser Acquired
purchasedacquiredbought
START END
TheyesterdayReuters
Merging Context and Targets
In context HMM, targets are collapsed into a single state that always emits “purchaser” etc.
Target HMMs have single START and END state Glue target HMMs into place by “compiling out”
start/end transitions and creating one big HMM
Challenge: create supportive structure without being overly restrictive Too little structure hard to find regularities Too much structure can’t generate all docs
Example of Merging HMMsBackground
ContextPurchaser Acquired
START END
START END
Background
Context Acquired
START END
Tricks and Optimizations
Mandatory end state Allows explicit modeling of document end
Structural enhancements Add transitions from start directly to targets Add transitions from target/suffix directly to end Allow “skip-ahead” transitions
Separation of core structure learning Structure learning is performed on “skeleton”
structure Enhancements are added during parameter
estimation Keeps search tractable while exploiting rich
transitions
Sample of Recent F1 Results
40%
45%
50%
55%
60%
65%
purchaser dlramt average
Ave
rag
e F
1 o
ver
10 f
old
s
FrMcC
Jim
Chris2
S-Merged
Merged
Unknown Word Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
purchaser dlramt average
Single UNK
Held Out Decomp
Conditional Training
Observation: Joint HMMs waste states modeling patterns in background text Improves document likelihood (like n-grams) Doesn’t improve labeling accuracy (can hurt
it!) Ideally focus on prefixes, suffixes, etc. only
Idea: Maximize conditional probability of labels P(labels|words) instead of P(labels, words) Should only reward modeling helpful patterns Can’t use standard Baum-Welch training Solution: use numerical optimization (CG)
Potential of Conditional Training
Don’t waste states modeling background patterns
Toy data model: ((abc)*(eTo))* [T is target] e.g. abcabcabcabceToabcabceToabcabcabc Modeling abc improves joint likelihood but
provides no help for labeling targets
a|o
b
c|e
T
o
a|b|c
e
T
Optimal Joint Model Optimal Labeling Model
Running Conditional Training
Gradient descent requires differentiable function
Value:
Deriv:
Likelihood and expectations are easily computed with existing HMM algorithms Compute values with and without type
constraints
)()),,(
),,(log())|(log(
,
uc
ct
t lltwcP
twcPwcP
))|(),|(())|(log(
wwcwcP
ijuijcij
Forward algorithm
Param expectations
Challenges for Cond. Training
Need additional constraint to keep numbers small Can’t guarantee you’ll get a probability distribution But it’s ok if you’re just summing and multiplying! Solution: sum of all params must equal a constant
Need to fix parameter space ahead of time Can’t add states, new words, etc. Solution: start with large ergodic model in which all
states emit entire vocabulary (use UNK tokens) Need sensible initialization
Uniform structure has high variance Fixed structure usually dictates training
Results on Toy Data Set
Results on (([ae][bt][co])*(eto))* Contains spurious prefix/target/suffix-like
symbols Joint training always labels every t Conditional training eventually gets it
perfectly
Current and Future Work
Richer search operators for structure learning Richer models of unknown words (char-level) Reduce variance of conditional training Build reusable repository of target HMMs Integrate with larger IE framework(s)
Semantic Web / KAON LTG
Applications Semi-automatic ontology markup for web pages Smart email processing