Info 629 Presentation

Embed Size (px)

Citation preview

  • 7/31/2019 Info 629 Presentation

    1/34

    Hidden Markov Models

    Applied to Information Extraction

    Part I: Concept

    HMM Tutorial

    Part II: Sample Application

    AutoBib: web information extraction

    Larry Reeve

    INFO629: Artificial Intelligence

    Dr. Weber, Fall 2004

  • 7/31/2019 Info 629 Presentation

    2/34

    Part I: Concept

    HMM Motivation

    Real-world has structures and processes whichhave (or produce) observable outputs

    Usually sequential (process unfolds over time)

    Cannot see the event producing the output Example: speech signals

    Problem: how to construct a model of thestructure or process given only observations

  • 7/31/2019 Info 629 Presentation

    3/34

    HMM Background

    Basic theory developed and published in 1960s and 70s

    No widespread understanding and application until late 80s

    Why?

    Theory published in mathematic journals which were not widely read bypracticing engineers

    Insufficient tutorial material for readers to understand and apply concepts

  • 7/31/2019 Info 629 Presentation

    4/34

    HMM Uses

    Uses Speech recognition

    Recognizing spoken words and phrases

    Text processing Parsing raw records into structured records

    Bioinformatics Protein sequence prediction

    Financial Stock market forecasts (price pattern prediction) Comparison shopping services

  • 7/31/2019 Info 629 Presentation

    5/34

    HMM Overview

    Machine learning method

    Makes use of state machines

    Based on probabilistic models

    Useful in problems having sequential steps

    Can only observe output from states, not the

    states themselves Example: speech recognition

    Observe: acoustic signals

    Hidden States: phonemes

    (distinctive sounds of a language)

    State machine:

  • 7/31/2019 Info 629 Presentation

    6/34

    Observable Markov Model Example

    Weather Once each day weather is observed

    State 1: rain State 2: cloudy State 3: sunny

    What is the probability the weather forthe next 7 days will be:

    sun, sun, rain, rain, sun, cloudy, sun

    Each state corresponds to a physicalobservable event

    State transition matrix

    Rainy Cloudy Sunny

    Rainy 0.4 0.3 0.3

    Cloudy 0.2 0.6 0.2

    Sunny 0.1 0.1 0.8

  • 7/31/2019 Info 629 Presentation

    7/34

    Observable Markov Model

  • 7/31/2019 Info 629 Presentation

    8/34

    Hidden Markov Model Example

    Coin toss: Heads, tails sequence with 2 coins

    You are in a room, with a wall

    Person behind wall flips coin, tells result Coin selection and toss is hidden

    Cannot observe events, only output (heads, tails) fromevents

    Problem is then to build a model to explainobserved sequence of heads and tails

  • 7/31/2019 Info 629 Presentation

    9/34

    HMM Components

    A set of states (xs)

    A set of possible output symbols (ys)

    A state transition matrix (as) probability of making transition from

    one state to the next

    Output emission matrix (bs) probability of a emitting/observing a

    symbol at a particular state

    Initial probability vector probability of starting at a particular state Not shown, sometimes assumed to be 1

  • 7/31/2019 Info 629 Presentation

    10/34

    HMM Components

  • 7/31/2019 Info 629 Presentation

    11/34

    Common HMM Types

    Ergodic (fully connected):

    Every state of model can be reached ina single step from every other state of

    the model

    Bakis (left-right):

    As time increases, states proceed from

    left to right

  • 7/31/2019 Info 629 Presentation

    12/34

    HMM Core Problems

    Three problems must be solved for HMMs tobe useful in real-world applications

    1) Evaluation

    2) Decoding

    3) Learning

  • 7/31/2019 Info 629 Presentation

    13/34

    HMM Evaluation Problem

    Purpose: score how well a given model matchesa given observation sequence

    Example (Speech recognition):

    Assume HMMs (models) have been built for wordshome and work.

    Given a speech signal, evaluation can determine theprobability each model represents the utterance

  • 7/31/2019 Info 629 Presentation

    14/34

    HMM Decoding Problem

    Given a model and a set of observations, whatare the hidden states most likely to havegenerated the observations?

    Useful to learn about internal model structure,determine state statistics, and so forth

  • 7/31/2019 Info 629 Presentation

    15/34

    HMM Learning Problem

    Goal is to learn HMM parameters (training) State transition probabilities Observation probabilities at each state

    Training is crucial: it allows optimal adaptation of model parameters to observed

    training data using real-world phenomena

    No known method for obtaining optimal parametersfrom dataonly approximations

    Can be a bottleneck in HMM usage

  • 7/31/2019 Info 629 Presentation

    16/34

    HMM Concept Summary

    Build models representing the hidden states of aprocess or structure using only observations

    Use the models to evaluate probability that a modelrepresents a particular observation sequence

    Use the evaluation information in an application to:recognize speech, parse addresses, and many otherapplications

  • 7/31/2019 Info 629 Presentation

    17/34

    Part II: Application

    AutoBib System

    Provide a uniform view of several computer sciencebibliographic web data sources

    An automated web information extraction system thatrequires little human input Web pages designed differently from site-to-site

    IE requires training samples

    HMMs used to parse unstructured bibliographicrecords into a structured format: NLP

  • 7/31/2019 Info 629 Presentation

    18/34

    Web Information Extraction

    Converting Raw Records

  • 7/31/2019 Info 629 Presentation

    19/34

    Approach

    1) Provide seed database of structured records

    2) Extract raw records from relevant Web pages

    3) Match structured records to raw recordsTo build training samples

    4) Train HMM-based parser

    5) Parse unmatched raw recs into structured recs6) Merge new structured records into database

  • 7/31/2019 Info 629 Presentation

    20/34

    AutoBib Architecture

  • 7/31/2019 Info 629 Presentation

    21/34

    Step 1 - Seeding

    Provide seed database of structured recordsTake small collection of BibTeX format records and

    insert into database

    Cleaning step normalizes record fields Examples:

    Proc.Proceedings

    JanJanuary

    Manual step, executed once only

  • 7/31/2019 Info 629 Presentation

    22/34

    Step 2Extract Raw Records

    Extract raw records from relevant Web pages

    User specifies

    Web pages to extract from

    How to follow next page links for multiple pages

    Raw records are extracted

    Uses record-boundary discovery techniques Subtree of Interest = largest subtree of HTML tags

    Record separators = frequent HTML tags

  • 7/31/2019 Info 629 Presentation

    23/34

    Tokenized Records

    (Replace all HTML tags with ^)

  • 7/31/2019 Info 629 Presentation

    24/34

    Step 3 - Matching

    Match raw records Rto structured records S

    Apply 4 tests (heuristic-based)1) Match at least author in Rto an author in S

    2) S.yearmust appear in R

    3) IfS.pagesexists, Rmust contain it

    4) S.titleis approximately contained in R

    Levenshtein edit distanceapproximate string match

  • 7/31/2019 Info 629 Presentation

    25/34

    Step 4Parser Training

    Train HMM-based parser

    For each pair ofRand Sthat match, annotate tokensin raw record with field names

    Annotated raw records are fed into HMM parser inorder to learn:

    State transition probabilities Symbol probabilities at each state

  • 7/31/2019 Info 629 Presentation

    26/34

    Parser Training, continued

    Key consideration is HMM structure for navigatingrecord fields (fields, delimiters) Special states

    start, end

    Normal states author, title, year, etc.

    Best structure found: Have multiple delimiter and tag states,

    one for each normal state Example: author-delimiter, author-tag

  • 7/31/2019 Info 629 Presentation

    27/34

    Sample HMM(Method 3)

    Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg

  • 7/31/2019 Info 629 Presentation

    28/34

  • 7/31/2019 Info 629 Presentation

    29/34

    Step 6 - Merging

    Merge new structured records into database

    Initial seed database has now grown

    New records will be used for improved

    matching on the next run

  • 7/31/2019 Info 629 Presentation

    30/34

    Evaluation

    Success rate:# of tokens labeled by HMM

    -------------------------------------

    # of tokens labeled by person

    DBLP: 98.9%

    Computer Science Bibliography

    CSWD: 93.4%

    CompuScience WWW-Database

  • 7/31/2019 Info 629 Presentation

    31/34

    HMM Advantages / Disadvantages

    Advantages Effective Can handle variations in record structure

    Optional fields

    Varying field ordering

    Disadvantages Requires training using annotated data

    Not completely automatic May require manual markup Size of training data may be an issue

  • 7/31/2019 Info 629 Presentation

    32/34

    Other methods

    Wrappers Specification of areas of interest on Web page

    Hand-crafted

    Wrapper induction Requires manual training

    Not always accommodating to changing structure

    Syntax-based; no semantic labeling

  • 7/31/2019 Info 629 Presentation

    33/34

    Application to Other Domains

    E-Commerce

    Comparison shopping sites

    Extract product/pricing information from many sites

    Convert information into structured format and store

    Provide interface to look up product information andthen display pricing information gathered from many sites

    Saves users time

    Rather than navigating to and searching many sites, userscan consult a single site

  • 7/31/2019 Info 629 Presentation

    34/34

    References

    Concept: Rabiner, L. R. (1989). A Tutorial on Hidden Markov

    Models and Selected Applications in Speech

    Recognition. Proceedings of the IEEE, 77(2), 257-285.

    Application: Geng, J. and Yang, J. (2004). Automatic Extraction

    of Bibliographic Information on the Web. Proceedingsof the 8th International Database Engineering and

    Applications Symposium (IDEAS04), 193-204.