Info 629 Presentation

7/31/2019 Info 629 Presentation

1/34

Hidden Markov Models

Applied to Information Extraction

Part I: Concept

HMM Tutorial

Part II: Sample Application

AutoBib: web information extraction

Larry Reeve

INFO629: Artificial Intelligence

Dr. Weber, Fall 2004


2/34

Part I: Concept

HMM Motivation

Real-world has structures and processes whichhave (or produce) observable outputs

Usually sequential (process unfolds over time)

Cannot see the event producing the output Example: speech signals

Problem: how to construct a model of thestructure or process given only observations


3/34

HMM Background

Basic theory developed and published in 1960s and 70s

No widespread understanding and application until late 80s

Why?

Theory published in mathematic journals which were not widely read bypracticing engineers

Insufficient tutorial material for readers to understand and apply concepts


4/34

HMM Uses

Uses Speech recognition

Recognizing spoken words and phrases

Text processing Parsing raw records into structured records

Bioinformatics Protein sequence prediction

Financial Stock market forecasts (price pattern prediction) Comparison shopping services


5/34

HMM Overview

Machine learning method

Makes use of state machines

Based on probabilistic models

Useful in problems having sequential steps

Can only observe output from states, not the

states themselves Example: speech recognition

Observe: acoustic signals

Hidden States: phonemes

(distinctive sounds of a language)

State machine:


6/34

Observable Markov Model Example

Weather Once each day weather is observed

State 1: rain State 2: cloudy State 3: sunny

What is the probability the weather forthe next 7 days will be:

sun, sun, rain, rain, sun, cloudy, sun

Each state corresponds to a physicalobservable event

State transition matrix

Rainy Cloudy Sunny

Rainy 0.4 0.3 0.3

Cloudy 0.2 0.6 0.2

Sunny 0.1 0.1 0.8


7/34

Observable Markov Model


8/34

Hidden Markov Model Example

Coin toss: Heads, tails sequence with 2 coins

You are in a room, with a wall

Person behind wall flips coin, tells result Coin selection and toss is hidden

Cannot observe events, only output (heads, tails) fromevents

Problem is then to build a model to explainobserved sequence of heads and tails


9/34

HMM Components

A set of states (xs)

A set of possible output symbols (ys)

A state transition matrix (as) probability of making transition from

one state to the next

Output emission matrix (bs) probability of a emitting/observing a

symbol at a particular state

Initial probability vector probability of starting at a particular state Not shown, sometimes assumed to be 1


10/34

HMM Components


11/34

Common HMM Types

Ergodic (fully connected):

Every state of model can be reached ina single step from every other state of

the model

Bakis (left-right):

As time increases, states proceed from

left to right


12/34

HMM Core Problems

Three problems must be solved for HMMs tobe useful in real-world applications

1) Evaluation

2) Decoding

3) Learning


13/34

HMM Evaluation Problem

Purpose: score how well a given model matchesa given observation sequence

Example (Speech recognition):

Assume HMMs (models) have been built for wordshome and work.

Given a speech signal, evaluation can determine theprobability each model represents the utterance


14/34

HMM Decoding Problem

Given a model and a set of observations, whatare the hidden states most likely to havegenerated the observations?

Useful to learn about internal model structure,determine state statistics, and so forth


15/34

HMM Learning Problem

Goal is to learn HMM parameters (training) State transition probabilities Observation probabilities at each state

Training is crucial: it allows optimal adaptation of model parameters to observed

training data using real-world phenomena

No known method for obtaining optimal parametersfrom dataonly approximations

Can be a bottleneck in HMM usage


16/34

HMM Concept Summary

Build models representing the hidden states of aprocess or structure using only observations

Use the models to evaluate probability that a modelrepresents a particular observation sequence

Use the evaluation information in an application to:recognize speech, parse addresses, and many otherapplications


17/34

Part II: Application

AutoBib System

Provide a uniform view of several computer sciencebibliographic web data sources

An automated web information extraction system thatrequires little human input Web pages designed differently from site-to-site

IE requires training samples

HMMs used to parse unstructured bibliographicrecords into a structured format: NLP


18/34

Web Information Extraction

Converting Raw Records


19/34

Approach

1) Provide seed database of structured records

2) Extract raw records from relevant Web pages

3) Match structured records to raw recordsTo build training samples

4) Train HMM-based parser

5) Parse unmatched raw recs into structured recs6) Merge new structured records into database


20/34

AutoBib Architecture


21/34

Step 1 - Seeding

Provide seed database of structured recordsTake small collection of BibTeX format records and

insert into database

Cleaning step normalizes record fields Examples:

Proc.Proceedings

JanJanuary

Manual step, executed once only


22/34

Step 2Extract Raw Records

Extract raw records from relevant Web pages

User specifies

Web pages to extract from

How to follow next page links for multiple pages

Raw records are extracted

Uses record-boundary discovery techniques Subtree of Interest = largest subtree of HTML tags

Record separators = frequent HTML tags


23/34

Tokenized Records

(Replace all HTML tags with ^)


24/34

Step 3 - Matching

Match raw records Rto structured records S

Apply 4 tests (heuristic-based)1) Match at least author in Rto an author in S

2) S.yearmust appear in R

3) IfS.pagesexists, Rmust contain it

4) S.titleis approximately contained in R

Levenshtein edit distanceapproximate string match


25/34

Step 4Parser Training

Train HMM-based parser

For each pair ofRand Sthat match, annotate tokensin raw record with field names

Annotated raw records are fed into HMM parser inorder to learn:

State transition probabilities Symbol probabilities at each state


26/34

Parser Training, continued

Key consideration is HMM structure for navigatingrecord fields (fields, delimiters) Special states

start, end

Normal states author, title, year, etc.

Best structure found: Have multiple delimiter and tag states,

one for each normal state Example: author-delimiter, author-tag


27/34

Sample HMM(Method 3)

Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg


28/34


29/34

Step 6 - Merging

Merge new structured records into database

Initial seed database has now grown

New records will be used for improved

matching on the next run


30/34

Evaluation

Success rate:# of tokens labeled by HMM

-------------------------------------

# of tokens labeled by person

DBLP: 98.9%

Computer Science Bibliography

CSWD: 93.4%

CompuScience WWW-Database


31/34

HMM Advantages / Disadvantages

Advantages Effective Can handle variations in record structure

Optional fields

Varying field ordering

Disadvantages Requires training using annotated data

Not completely automatic May require manual markup Size of training data may be an issue


32/34

Other methods

Wrappers Specification of areas of interest on Web page

Hand-crafted

Wrapper induction Requires manual training

Not always accommodating to changing structure

Syntax-based; no semantic labeling


33/34

Application to Other Domains

E-Commerce

Comparison shopping sites

Extract product/pricing information from many sites

Convert information into structured format and store

Provide interface to look up product information andthen display pricing information gathered from many sites

Saves users time

Rather than navigating to and searching many sites, userscan consult a single site


34/34

References

Concept: Rabiner, L. R. (1989). A Tutorial on Hidden Markov

Models and Selected Applications in Speech

Recognition. Proceedings of the IEEE, 77(2), 257-285.

Application: Geng, J. and Yang, J. (2004). Automatic Extraction

of Bibliographic Information on the Web. Proceedingsof the 8th International Database Engineering and

Applications Symposium (IDEAS04), 193-204.

Documents

Info 629 Presentation