Upload
sugandha-saha
View
221
Download
0
Embed Size (px)
Citation preview
7/31/2019 Info 629 Presentation
1/34
Hidden Markov Models
Applied to Information Extraction
Part I: Concept
HMM Tutorial
Part II: Sample Application
AutoBib: web information extraction
Larry Reeve
INFO629: Artificial Intelligence
Dr. Weber, Fall 2004
7/31/2019 Info 629 Presentation
2/34
Part I: Concept
HMM Motivation
Real-world has structures and processes whichhave (or produce) observable outputs
Usually sequential (process unfolds over time)
Cannot see the event producing the output Example: speech signals
Problem: how to construct a model of thestructure or process given only observations
7/31/2019 Info 629 Presentation
3/34
HMM Background
Basic theory developed and published in 1960s and 70s
No widespread understanding and application until late 80s
Why?
Theory published in mathematic journals which were not widely read bypracticing engineers
Insufficient tutorial material for readers to understand and apply concepts
7/31/2019 Info 629 Presentation
4/34
HMM Uses
Uses Speech recognition
Recognizing spoken words and phrases
Text processing Parsing raw records into structured records
Bioinformatics Protein sequence prediction
Financial Stock market forecasts (price pattern prediction) Comparison shopping services
7/31/2019 Info 629 Presentation
5/34
HMM Overview
Machine learning method
Makes use of state machines
Based on probabilistic models
Useful in problems having sequential steps
Can only observe output from states, not the
states themselves Example: speech recognition
Observe: acoustic signals
Hidden States: phonemes
(distinctive sounds of a language)
State machine:
7/31/2019 Info 629 Presentation
6/34
Observable Markov Model Example
Weather Once each day weather is observed
State 1: rain State 2: cloudy State 3: sunny
What is the probability the weather forthe next 7 days will be:
sun, sun, rain, rain, sun, cloudy, sun
Each state corresponds to a physicalobservable event
State transition matrix
Rainy Cloudy Sunny
Rainy 0.4 0.3 0.3
Cloudy 0.2 0.6 0.2
Sunny 0.1 0.1 0.8
7/31/2019 Info 629 Presentation
7/34
Observable Markov Model
7/31/2019 Info 629 Presentation
8/34
Hidden Markov Model Example
Coin toss: Heads, tails sequence with 2 coins
You are in a room, with a wall
Person behind wall flips coin, tells result Coin selection and toss is hidden
Cannot observe events, only output (heads, tails) fromevents
Problem is then to build a model to explainobserved sequence of heads and tails
7/31/2019 Info 629 Presentation
9/34
HMM Components
A set of states (xs)
A set of possible output symbols (ys)
A state transition matrix (as) probability of making transition from
one state to the next
Output emission matrix (bs) probability of a emitting/observing a
symbol at a particular state
Initial probability vector probability of starting at a particular state Not shown, sometimes assumed to be 1
7/31/2019 Info 629 Presentation
10/34
HMM Components
7/31/2019 Info 629 Presentation
11/34
Common HMM Types
Ergodic (fully connected):
Every state of model can be reached ina single step from every other state of
the model
Bakis (left-right):
As time increases, states proceed from
left to right
7/31/2019 Info 629 Presentation
12/34
HMM Core Problems
Three problems must be solved for HMMs tobe useful in real-world applications
1) Evaluation
2) Decoding
3) Learning
7/31/2019 Info 629 Presentation
13/34
HMM Evaluation Problem
Purpose: score how well a given model matchesa given observation sequence
Example (Speech recognition):
Assume HMMs (models) have been built for wordshome and work.
Given a speech signal, evaluation can determine theprobability each model represents the utterance
7/31/2019 Info 629 Presentation
14/34
HMM Decoding Problem
Given a model and a set of observations, whatare the hidden states most likely to havegenerated the observations?
Useful to learn about internal model structure,determine state statistics, and so forth
7/31/2019 Info 629 Presentation
15/34
HMM Learning Problem
Goal is to learn HMM parameters (training) State transition probabilities Observation probabilities at each state
Training is crucial: it allows optimal adaptation of model parameters to observed
training data using real-world phenomena
No known method for obtaining optimal parametersfrom dataonly approximations
Can be a bottleneck in HMM usage
7/31/2019 Info 629 Presentation
16/34
HMM Concept Summary
Build models representing the hidden states of aprocess or structure using only observations
Use the models to evaluate probability that a modelrepresents a particular observation sequence
Use the evaluation information in an application to:recognize speech, parse addresses, and many otherapplications
7/31/2019 Info 629 Presentation
17/34
Part II: Application
AutoBib System
Provide a uniform view of several computer sciencebibliographic web data sources
An automated web information extraction system thatrequires little human input Web pages designed differently from site-to-site
IE requires training samples
HMMs used to parse unstructured bibliographicrecords into a structured format: NLP
7/31/2019 Info 629 Presentation
18/34
Web Information Extraction
Converting Raw Records
7/31/2019 Info 629 Presentation
19/34
Approach
1) Provide seed database of structured records
2) Extract raw records from relevant Web pages
3) Match structured records to raw recordsTo build training samples
4) Train HMM-based parser
5) Parse unmatched raw recs into structured recs6) Merge new structured records into database
7/31/2019 Info 629 Presentation
20/34
AutoBib Architecture
7/31/2019 Info 629 Presentation
21/34
Step 1 - Seeding
Provide seed database of structured recordsTake small collection of BibTeX format records and
insert into database
Cleaning step normalizes record fields Examples:
Proc.Proceedings
JanJanuary
Manual step, executed once only
7/31/2019 Info 629 Presentation
22/34
Step 2Extract Raw Records
Extract raw records from relevant Web pages
User specifies
Web pages to extract from
How to follow next page links for multiple pages
Raw records are extracted
Uses record-boundary discovery techniques Subtree of Interest = largest subtree of HTML tags
Record separators = frequent HTML tags
7/31/2019 Info 629 Presentation
23/34
Tokenized Records
(Replace all HTML tags with ^)
7/31/2019 Info 629 Presentation
24/34
Step 3 - Matching
Match raw records Rto structured records S
Apply 4 tests (heuristic-based)1) Match at least author in Rto an author in S
2) S.yearmust appear in R
3) IfS.pagesexists, Rmust contain it
4) S.titleis approximately contained in R
Levenshtein edit distanceapproximate string match
7/31/2019 Info 629 Presentation
25/34
Step 4Parser Training
Train HMM-based parser
For each pair ofRand Sthat match, annotate tokensin raw record with field names
Annotated raw records are fed into HMM parser inorder to learn:
State transition probabilities Symbol probabilities at each state
7/31/2019 Info 629 Presentation
26/34
Parser Training, continued
Key consideration is HMM structure for navigatingrecord fields (fields, delimiters) Special states
start, end
Normal states author, title, year, etc.
Best structure found: Have multiple delimiter and tag states,
one for each normal state Example: author-delimiter, author-tag
7/31/2019 Info 629 Presentation
27/34
Sample HMM(Method 3)
Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg
7/31/2019 Info 629 Presentation
28/34
7/31/2019 Info 629 Presentation
29/34
Step 6 - Merging
Merge new structured records into database
Initial seed database has now grown
New records will be used for improved
matching on the next run
7/31/2019 Info 629 Presentation
30/34
Evaluation
Success rate:# of tokens labeled by HMM
-------------------------------------
# of tokens labeled by person
DBLP: 98.9%
Computer Science Bibliography
CSWD: 93.4%
CompuScience WWW-Database
7/31/2019 Info 629 Presentation
31/34
HMM Advantages / Disadvantages
Advantages Effective Can handle variations in record structure
Optional fields
Varying field ordering
Disadvantages Requires training using annotated data
Not completely automatic May require manual markup Size of training data may be an issue
7/31/2019 Info 629 Presentation
32/34
Other methods
Wrappers Specification of areas of interest on Web page
Hand-crafted
Wrapper induction Requires manual training
Not always accommodating to changing structure
Syntax-based; no semantic labeling
7/31/2019 Info 629 Presentation
33/34
Application to Other Domains
E-Commerce
Comparison shopping sites
Extract product/pricing information from many sites
Convert information into structured format and store
Provide interface to look up product information andthen display pricing information gathered from many sites
Saves users time
Rather than navigating to and searching many sites, userscan consult a single site
7/31/2019 Info 629 Presentation
34/34
References
Concept: Rabiner, L. R. (1989). A Tutorial on Hidden Markov
Models and Selected Applications in Speech
Recognition. Proceedings of the IEEE, 77(2), 257-285.
Application: Geng, J. and Yang, J. (2004). Automatic Extraction
of Bibliographic Information on the Web. Proceedingsof the 8th International Database Engineering and
Applications Symposium (IDEAS04), 193-204.