Sequence Classification - with emphasis on Hidden Markov...

Sequential DataMethods

Hidden Markov ModelsKernels for Sequences

Sequence Classificationwith emphasis on Hidden Markov Models and Sequence Kernels

Andrew M. White

September 29, 2009

Andrew M. White Sequence Classification

Sequential Data

Methods

Hidden Markov ModelsEvaluation: The Forward AlgorithmDecoding: The Viterbi AlgorithmLearning: The Baum-Welch AlgorithmProfile HMMs

Kernels for SequencesFixed-Length Subsequence KernelsAll-Subsequences KernelVariations

Sequential Data

Methods

Examples

Biological Sequence Analysis

Ï genes

Ï proteins

Temporal Pattern Recognition

Ï speech

Ï gestures

Semantic Analysis

Ï handwriting

Ï part-of-speech detection

Examples

Ï genes

Ï proteins

Ï speech

Ï gestures

Semantic Analysis

Ï handwriting

Examples

Ï genes

Ï proteins

Ï speech

Ï gestures

Semantic Analysis

Ï handwriting

Sequential Data

Characteristics

Ï an example of structured data

Ï exhibit sequential correlation, i.e., nearby values are likely to berelated

Why not just use earlier techniques?

Ï difficult to find appopriate features

Ï structural information is important

Example Framework

Speech Recognition

Ï goal: identify individual phonemes (the building blocks ofspeech, sounds like “ch” and “t” )

Ï source data:Ï quantized speech waveformsÏ tagged phonemes as sequence of values

Ï multiple classes, each:Ï has hundreds to thousands of sequencesÏ sequences vary in length

Example Framework

Speech Recognition

Example Framework

Speech Recognition

Sequential Data

Methods

Methods for Sequence Classification

Generative Models

Ï Hidden Markov ModelsÏ Stochastic Context-Free GrammarsÏ Conditional Random Fields

Discriminative Methods

Ï Kernel Methods (incl. SVMs)Ï Max-margin Markov Networks

Generative Models

Evaluation: The Forward AlgorithmDecoding: The Viterbi AlgorithmLearning: The Baum-Welch AlgorithmProfile HMMs

Sequential Data

Methods

First Order Markov Models

OFFICE CAFE

Transition Probabilitieshome office cafe

home 0.2 0.6 0.2office 0.5 0.2 0.3

cafe 0.2 0.8 0.0

Assuming current statedepends ONLY on previous,can easily determineprobability of any path, e.g.,home, cafe, office, home:

P(HCOH) = P(C|H)P(O|C)P(H|O) = (0.2)(0.8)(0.5) = 0.08

OFFICE CAFE

home 0.2 0.6 0.2office 0.5 0.2 0.3

cafe 0.2 0.8 0.0

P(HCOH) = P(C|H)P(O|C)P(H|O) = (0.2)(0.8)(0.5) = 0.08

OFFICE CAFE

home 0.2 0.6 0.2office 0.5 0.2 0.3

cafe 0.2 0.8 0.0

P(HCOH) = P(C|H)P(O|C)P(H|O) = (0.2)(0.8)(0.5) = 0.08

OFFICE CAFE

home 0.2 0.6 0.2office 0.5 0.2 0.3

cafe 0.2 0.8 0.0

P(HCOH) = P(C|H)P(O|C)P(H|O) = (0.2)(0.8)(0.5) = 0.08

.Andrew M. White Sequence Classification

First Order Hidden Markov Models

OFFICE CAFE

EMAIL SMS

FACEBOOK

home 0.2 0.6 0.2office 0.5 0.2 0.3

cafe 0.2 0.8 0.0

Can’t directly observe the states– only the emissions. Does thischange anything?

In this case, no.

P(FSEF) = P(HCOH) = P(C|H)P(O|C)P(H|O) = (0.2)(0.8)(0.5) = 0.08

OFFICE CAFE

EMAIL SMS

FACEBOOK

home 0.2 0.6 0.2office 0.5 0.2 0.3

cafe 0.2 0.8 0.0

In this case, no.

OFFICE CAFE

EMAIL SMS

FACEBOOK

home 0.2 0.6 0.2office 0.5 0.2 0.3

cafe 0.2 0.8 0.0

In this case, no.

First Order Hidden Markov Models (cont.)

SMS10%

FACEBOOK

OFFICE CAFE

What if the emissions aren’t tiedto individual states?

SMS10%

FACEBOOK

OFFICE CAFE

What if the emissions aren’t tiedto individual states?

home 0.2 0.6 0.2office 0.5 0.2 0.3

cafe 0.2 0.8 0.0

Emission Probabilitiessms facebook email

home 0.3 0.5 0.2office 0.1 0.1 0.8

cafe 0.8 0.1 0.1Now we have to look at all possible state sequences which couldhave generated the given observation sequence.

The Three Canonical Problems of Hidden Markov Models

Evaluation

Given: parameters, observation sequenceFind: P( observation sequence | parameters )

Decoding

Given: parameters, observation sequenceFind: most likely state sequence

Learning

Given: observation sequence(s)Find: parameters

Evaluation

Decoding

Learning

Evaluation

Decoding

Learning

HMMs: Notation

For now, consider a single observation sequence

O = o1o2...oT

and an associated (unknown) state sequence

Q = q1q2...qT .

Denote the transition probabilities:

aij = P( transition from node i to node j )

Similarly, the emission probabilities:

bjk = P( emission of symbol k from node j )

HMMs: Notation

O = o1o2...oT

Q = q1q2...qT .

HMMs: Notation

O = o1o2...oT

Q = q1q2...qT .

HMMs: Assumptions

We have two big assumptions to make:

Markov Assumption

Current state depends only on the previous state.

Independence Assumption

Current emission depends only on the current state.

HMMs: Assumptions

Markov Assumption

HMMs: Assumptions

Markov Assumption

Sequential Data

Methods

HMMs: Evaluation

Under the Markov assumption, for a state sequence Q, can calculatethe probablity of this sequence given the parameters (A,B) =λ:

P(Q|λ) =T∏

t=2P(qt |qt−1) = aq1q2 aq2q3 ...aqT−1qT

Under our independence assumption, can find the probability of anobservation sequence O given a state sequence Q and theparameters (A,B) =λ:

P(O|Q,λ) =T∏

t=1P(ot |qt) = bq1o1 bq2o2 ...bqT−1oT−1 bqT oT

HMMs: Evaluation

P(Q|λ) =T∏

P(O|Q,λ) =T∏

HMMs: Evaluation

P(Q|λ) =T∏

P(O|Q,λ) =T∏

HMMs: Evaluation

P(Q|λ) =T∏

P(O|Q,λ) =T∏

HMMs: Evaluation (continued)

Now we can find the probability of an observation sequence giventhe model:

P(O|λ) =∑Q

P(O|Q,λ)P(Q|λ) =∑Q

T∏t=2

aqt−1qt bqt ot

Note that this does a lot of redundant calculations.

P(O|λ) =∑Q

T∏t=2

aqt−1qt bqt ot

P(O|λ) =∑Q

T∏t=2

aqt−1qt bqt ot

A trellis algorithm.

We can use dynamic programmingto cache the redundant calculationsby thinking in terms of partialobservation sequences:

αj(t) = P(o1o2...ot ,qt = sj|λ)

We’ll refer to these as the forwardprobabilities for the observationsequence.

HMMs: the Forward algorithm

The forward trellis.

For the first time step we have:

αj(1) = P(o1|q1 = sj) = bjo1

Then we can calculate the forwardprobabilities from the trellis:

= bjot

N∑i=1

aijαi(t −1)

HMMs: the Forward algorithm

For the first time step we have:

αj(1) = P(o1|q1 = sj) = bjo1

Then we can calculate the forwardprobabilities from the trellis:

= bjot

N∑i=1

aijαi(t −1)

HMMs: the Forward algorithm (continued)

Finally, the probability of the fullobservation sequence is the sum ofthe forward probabilities at the lasttime-step:

P(O|λ) =N∑

j=1αj(T)

Sequential Data

Methods

HMMs: the Viterbi Algorithm

Kevin Snow will present details of the Viterbi algorithm in the nextclass.

Sequential Data

Methods

HMMs: the Baum-Welch algorithm

Given an observation sequence, how do we estimate theparameters of the HMM?

Ï transition probabilities:

aij = expected number of transitions from state i to state j

expected number of transitions from state i

Ï emission probabilities:

bjk =expected number of emissions of symbol k from state j

expected number of emissions from state j

This is another Expectation-Maximization algorithm.

HMMs: Baum-Welch as Expectation-Maximization

Remember, the sequence of states for the observation sequence isour hidden variable.

Expectation

Ï an observation sequence O

Ï an estimate of the parameters λ

we can find the expectation of the log-likelihood for the observationsequence over the possible state sequences.

MaximizationThen we maximize this expectation over all possible λ.

Baum et al proved that this procedure converges to a localmaximum.

Expectation

Sequential Data

Methods

Profile HMMs

Left-Right HMMs

Ï special type of HMMs

Ï links only go in one direction

Ï no circular routes involving more than one node

Ï specialized start and end nodes

Profile HMMs

Ï special type of Left-Right HMMs

Ï has special delete states which don’t emit symbols

Ï consists of sets of match, insert, and delete states

Profile HMM (continued)

Figure: Topology of a Profile HMM

HMMs in our Example Framework

Recall the framework for classifying phonemes.

Generative HMM classifier for speech recognition

Ï for each phoneme, train a (profile) HMM using Baum-WelchÏ for each test example:

Ï score using the Forward algorithm for each HMMÏ classify according to whichever HMM scores highest

HMMs in our Example Framework

Generative HMM classifier for speech recognition

Ï for each phoneme, train a (profile) HMM using Baum-WelchÏ for each test example:

Ï score using the Forward algorithm for each HMMÏ classify according to whichever HMM scores highest

Generative vs. Discriminative

HMMs as Generative Models

Ï can treat an HMM as a generator for a distribution

Ï build an individual HMM for each class of interest

Ï can give probability of an example given the model

Kernel Methods as Discriminative Models

Ï model pairs of classes

Ï find discriminant functions

Generative vs. Discriminative

HMMs as Generative Models

Ï can treat an HMM as a generator for a distribution

Ï build an individual HMM for each class of interest

Ï can give probability of an example given the model

Kernel Methods as Discriminative Models

Ï model pairs of classes

Ï find discriminant functions

Fixed-Length Subsequence KernelsAll-Subsequences KernelVariations

Sequential Data

Methods

Kernels for Sequences

Fixed-length Subsequence Kernels

Based on counting common subsequences of a fixed length.

Ï p-spectrum kernelsÏ fixed-length subsequences kernelÏ gap-weighted subsequences kernel

All-subsequences Kernel

Based on counting all common contiguous or non-contiguoussubsequences.

Sequential Data

Methods

Fixed-Length Subsequence Kernels (continued)

p-Spectrum Kernels

Ï counts number of common (contiguous) subsequences oflength p

Ï useful where contiguity is important structurally

φ ar at ba cabar 1 0 1 0bat 0 1 1 0car 1 0 0 1cat 0 1 0 1

K bar bat car catbar 2 1 1 0bat 1 2 0 1car 1 0 2 1cat 0 1 1 2

p-Spectrum Kernels

Fixed-Length Spectrum Kernels

Ï counts common (non-contiguous) subsequences of a givenlength

Ï allows insertions/deletions with no penalties

φ aa ar at ba br btbaa 1 0 0 2 0 0bar 0 1 0 1 1 0bat 0 0 1 1 0 1

K baa bar batbaa 3 2 2bar 2 3 1bat 2 1 3

Gap-Weighted Subsequences Kernel

Ï interpolates between fixed-length and p-spectrum kernels

Ï allows weighting of the importance of contiguity

φ aa ar at ba br btbaa λ2 0 0 λ2 +λ3 0 0bar 0 λ2 0 λ2 λ3 0bat 0 0 λ2 λ2 0 λ3

Sequential Data

Methods

All-Subsequences Kernel

All-subsequences Kernel

Ï counts number of common subsequences of any lengthÏ contiguous and non-contiguous subsequences considered

Sequential Data

Methods

Variations of Sequences for Kernels

Character Weights

different inserted characters result in different values

Soft Matching

two different characters can match with penalty

Gap-number Weighting

weight number of gaps instead of gap lengths

For details, see Shawe-Taylor and Christianini, 2004.

Character Weights

Soft Matching

Character Weights

Soft Matching

Character Weights

Soft Matching

Sequence Kernels in our Example Framework

Discriminative SVM classifier for speech recognition

Ï for each pair of phonemes, train a binary classifier using asequence kernel

Ï for each test example:Ï each binary classifier votes for a labelÏ classify according to whichever label receives the most votes

Sequence Kernels in our Example Framework

Discriminative SVM classifier for speech recognition

Ï for each pair of phonemes, train a binary classifier using asequence kernel

Ï for each test example:Ï each binary classifier votes for a labelÏ classify according to whichever label receives the most votes

Sequential Data

Ï can be treated differently from feature-vector data

Ï provides structurally important information

Hidden Markov Models

Ï generative models of sequencesÏ assume:

Ï state t depends only on state t −1Ï emission t depends only on state t

Ï discriminative approach

Ï many different kernels

Sequential Data

References

Phil Blunsom.Hidden markov models, 2004.

Lawrence R. Rabiner.A tutorial on hidden markov models and selected applicationsin speech recognition.In Proceedings of the IEEE, pages 257–286, 1989.

John Shawe-Taylor and Nello Cristianini.Kernel Methods for Pattern Analysis.Cambridge University Press, Cambridge, United Kingdom,2004.

Sequence Classification - with emphasis on Hidden Markov...

Documents

Hidden Markov Models in Bioinformaticscsatol/mach_learn/bemutato/Mate_Korosi_HMMpr… · Outline ˜ Markov Chain ˜ HMM (Hidden Markov Model) ˜ Hidden Markov Models in Bioinformatics

Inference in Mixed Hidden Markov Models and Applications ... · 2. From hidden Markov models to mixed hidden Markov models Many authors suggested applying hidden Markov models to

Hidden Markov Models

Hidden Markov Models - AUusers-cs.au.dk/cstorm/courses/PRiB_f12/slides/hidden-markov-model… · Hidden Markov Models Markov Model Hidden Markov Model If the latent variables are

Markov Chains and Hidden Markov Models - Rice University · are “hidden”; hence, we have a hidden Markov model, or HMM. ... In Markov chains and hidden Markov models, the probability

EE365: Hidden Markov Models - Stanford Universityee266.stanford.edu/lectures/hmm.pdf · EE365: Hidden Markov Models Hidden Markov Models The Viterbi Algorithm 1. Hidden Markov Models

Hidden Markov Models - Home | Princeton Universityrvan/orf557/hmm080728.pdf · 1 Hidden Markov Models..... 1 1.1 Markov Processes ..... 1 1.2 Hidden Markov Models..... 4 ... able

Hidden Markov Models · Hidden Markov Models 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Nov. 7, 2018 ... Hidden Markov Model 28 A Hidden Markov Model (HMM)

Hidden Markov Model Nov 11, 2008 Sung-Bae Cho. Hidden Markov Model Inference of Hidden Markov Model Path Tracking of HMM Learning of Hidden Markov Model

Markov hidden

Hidden Markov Modelle

Hidden Markov Models and Gaussian Mixture Models · Hidden Markov Models and Gaussian Mixture Models ... Hidden Markov Model ... ASR Lectures 4&5 Hidden Markov Models and Gaussian

Hidden Markov

Lecture 6a: Introduction to Hidden Markov Models · Lecture 6a: Introduction to Hidden Markov Models ... Markov Chain/Hidden Markov Model ... The states are hidden from the

9 Markov chains and Hidden Markov Models - Freie … · 9 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Algorithms: Viterbi, forward,

Hidden Markov Models./awm/tutorials/hmm14.pdf · Hidden Markov Models ... 14)

L13: hidden Markov models - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/sp/l13.pdf · L13: hidden Markov models • Discrete Markov processes • Hidden Markov models

L13: hidden Markov modelscourses.cs.tamu.edu/rgutier/csce630_f14/l13.pdf · L13: hidden Markov models • Discrete Markov processes • Hidden Markov models • Forward and Backward

Sequence Classification - with emphasis on Hidden Markov …lazebnik/fall09/sequence... · 2009. 9. 30. · Sequential Data Methods Hidden Markov Models Kernels for Sequences

Hidden markov model