Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden...

GLOBEX Bioinformatics

(Summer 2015)

Hidden Markov Models (I)

a. The model

b. The decoding: Viterbi algorithm

Hidden Markov models • A Markov chain of states

• At each state, there are a set of possible observables (symbols), and

• The states are not directly observable, namely, they are hidden.

• E.g., Casino fraud

• Three major problems

– Most probable state path

– The likelihood

– Parameter estimation for HMMs

1: 1/6

2: 1/6

3: 1/6

4: 1/6

5: 1/6

6: 1/6

1: 1/10

2: 1/10

3: 1/10

4: 1/10

5: 1/10

6: 1/2

0.95 0.9

Fair Loaded

A biological example: CpG islands

• Higher rate of Methyl-C mutating to T in CpG dinucleotides →

generally lower CpG presence in genome, except at some biologically

important ranges, e.g., in promoters, -- called CpG islands.

• The conditional probabilities P±(N|N’)are collected from ~ 60,000 bps

human genome sequences, + stands for CpG islands and – for non

CpG islands.

P- A C G T

.300 .205 .285 .210

.322 .298 .078 .302

.248 .246 .298 .208

.177 .239 .292 .292

P+ A C G T

.180 .274 .426 .120

.171 .368 .274 .188

.161 .339 .375 .125

.079 .355 .384 .182

Task 1: given a sequence x, determine if it is a CpG island.

One solution: compute the log-odds ratio scored by the two Markov chains:

S(x) = log [ P(x | model +) / P(x | model -)]

where P(x | model +) = P+(x2|x1) P+(x3|x2)… P+(xL|xL-1) and

P(x | model -) = P-(x2|x1) P-(x3|x2)… P-(xL|xL-1)

Histogram of the length-normalized scores

(CpG sequences are shown as dark shaded )

Task 2: For a long genomic sequence x, label these CpG islands, if there are any.

Approach 1: Adopt the method for Task 1 by calculating the log-odds score for a window of, say, 100 bps around every nucleotide and plotting it.

Problems with this approach:

– Won’t do well if CpG islands have sharp boundary and variable length

– No effective way to choose a good Window size.

Approach 2: using hidden Markov model

A: .170

C: .368

G: .274

T: .188

A: .372

C: .198

G: .112

T: .338

0.65 0.70

+ − • The model has two states, “+” for CpG island and “-” for non

CpG island. Those numbers are made up here, and shall be fixed

by learning from training examples.

• The notations: akl is the transition probability from state k to state

l; ek(b) is the emission frequency – probability that symbol b is

seen when in state k.

The probability that sequence x is emitted by a state path π is:

P(x, π) = ∏i=1 to L eπi (xi) a πi πi+1

i:123456789

x:TGCGCGTAC

π :--++++---

P(x, π) = 0.338 × 0.70 × 0.112 × 0.30 × 0.368 × 0.65 × 0.274 × 0.65 × 0.368 × 0.65 ×

0.274 × 0.35 × 0.338 × 0.70 × 0.372 × 0.70 × 0.198.

Then, the probability to observe sequence x in the model is

P(x) = π P(x, π),

which is also called the likelihood of the model.

A: .170

C: .368

G: .274

T: .188

A: .372

C: .198

G: .112

T: .338

0.65 0.70

Decoding: Given an observed sequence x, what is the most probable state path,

π* = argmax π P(x, π)

Q: Given a sequence x of length L, how many state paths do we have?

A: NL, where N stands for the number of states in the model.

As an exponential function of the input size, it precludes enumerating all possible state

paths for computing P(x).

Let vk(i) be the probability for the most probable path ending at position i with a state k.

Viterbi Algorithm

Initialization: v0(0) =1, vk(0) = 0 for k > 0.

Recursion: vk(i) = ek(xi) maxj (vj(i-1) ajk);

ptri(k) = argmaxj (vj(i-1) ajk);

Termination: P(x, π* ) = maxk(vk(L) ak0);

π*L = argmaxj (vj(L) aj0);

Traceback: π*i-1 = ptri (π*i).

A: .170

C: .368

G: .274

T: .188

A: .372

C: .198

G: .112

T: .338

0.65 0.70

ε A A G C G

0 0 0 0 0

.0039 .00093

.00053 .000042

.00038

vk(i) = ek(xi) maxj (vj(i-1) ajk);

− − + + + hidden state path

ε 0.5 0.5

Casino Fraud: investigation results by Viterbi decoding

• The log transformation for Viterbi algorithm

vk(i) = ek(xi) maxj (vj(i-1) ajk);

ajk = log ajk;

ek(xi) = log ek(xi);

vk(i) = log vk(i);

vk(i) = ek(xi) + maxj (vj(i-1) + ajk);

(Summer 2015)

Hidden Markov Models (II)

• The model likelihood: Forward

algorithm, backward algorithm

• Posterior decoding

The probability that sequence x is emitted by a state path π is:

P(x, π) = ∏i=1 to L eπi (xi) a πi πi+1

i:123456789

x:TGCGCGTAC

π :--++++---

P(x, π) = 0.338 × 0.70 × 0.112 × 0.30 × 0.368 × 0.65 × 0.274 × 0.65 × 0.368 × 0.65 ×

0.274 × 0.35 × 0.338 × 0.70 × 0.372 × 0.70 × 0.198.

Then, the probability to observe sequence x in the model is

P(x) = π P(x, π),

which is also called the likelihood of the model.

A: .170

C: .368

G: .274

T: .188

A: .372

C: .198

G: .112

T: .338

0.65 0.70

How to calculate the probability to observe sequence x in the model?

P(x) = π P(x, π)

Let fk(i) be the probability contributed by all paths from the beginning up

to (and include) position i with the state at position i being k.

The the following recurrence is true:

f k(i) = [ j f j(i-1) ajk ] ek(xi)

Graphically,

x1 x2 x3 xL

Again, a silent state 0 is introduced for better presentation

Forward algorithm

Initialization: f0(0) =1, fk(0) = 0 for k > 0.

Recursion: fk(i) = ek(xi) jfj(i-1) ajk.

Termination: P(x) = kfk(L) ak0.

Time complexity: O(N2L), where N is the number of states and L is the

sequence length.

Let bk(i) be the probability contributed by all paths that pass

state k at position i.

bk(i) = P(xi+1, …, xL | (i) = k)

Backward algorithm

Initialization: bk(L) = ak0 for all k.

Recursion (i = L-1, …, 1): bk(i) = j akj ej(xi+1) bj(i+1).

Termination: P(x) = k a0k ek(x1)bk(1).

Time complexity: O(N2L), where N is the number of states and L is the

sequence length.

Posterior decoding

P(πi = k |x) = P(x, πi = k) /P(x) = fk(i)bk(i) / P(x)

Algorithm:

for i = 1 to L

do argmax k P(πi = k |x)

Notes: 1. Posterior decoding may be useful when there are multiple almost

most probable paths, or when a function is defined on the states.

2. The state path identified by posterior decoding may not be most

probable overall, or may not even be a viable path.

(Summer 2015)

Hidden Markov Models (III)

- Viterbi training

- Baum-Welch algorithm

- Maximum Likelihood

- Expectation Maximization

Model building

- Topology - Requires domain knowledge

- Parameters - When states are labeled for sequences of observables

- Simple counting:

akl = Akl / l’Akl’ and ek(b) = Ek(b) / b’Ek(b’)

- When states are not labeled

Method 1 (Viterbi training) 1. Assign random parameters

2. Use Viterbi algorithm for labeling/decoding

2. Do counting to collect new akl and ek(b);

3. Repeat steps 2 and 3 until stopping criterion is met.

Method 2 (Baum-Welch algorithm)

Baum-Welch algorithm (Expectation-Maximization)

• An iterative procedure similar to Viterbi training

• Probability that akl is used at position i in sequence j.

P(πi = k, πi+1= l | x,θ ) = fk(i) akl el (xi+1) bl(i+1) / P(xj)

Calculate the expected number of times that is used by summing over all position and over all training sequences.

Akl = j {(1/P(xj) [i fkj(i) akl el (x

ji+1) bl

j(i+1)] }

Similarly, calculate the expected number of times that symbol b is emitted in state k.

Ek(b) =j {(1/P(xj) [{i|x_i^j = b} fkj(i) bk

j(i)] }

Maximum Likelihood

Define L() = P(x| )

Estimate such that the distribution with the estimated best agrees with or support the data observed so far.

ML = argmax L()

E.g. There are red and black balls in a box. What is the probability P of picking up a black ball?

Do sampling (with replacement).

Maximum Likelihood

Define L() = P(x| )

Estimate such that the distriibution with the estimated best agrees with or supports the

data observed so far.

ML= argmax L()

When L() is differentiable,

For example, want to know the ratio: # of blackball/# of whiteball, in other words, the

probability P of picking up a black ball. Sampling (with replacement):

Prob ( iid) = p9 (1-p) 91

Likelihood L(p) = p9(1-p)91.

=> PML = 9/100 = 9%. The ML estimate of P is just the frequency.

8 91 9 90( )9 (1 ) 91 (1 ) 0

L pp p p p

Counts: whiteball 91,

blackball 9

A proof that the observed frequency -> ML estimate of

probabilities for polynomial distribution

Let Counts ni for outcome i

The observed frequencies i = ni /N, where N = i ni

If i ML = ni /N, then P(n| ML ) > p(n| ) for any ML

Proof:

( | )log log log ( )

( | ) ( )

log( ) N log( ) log( )

H( || ) 0

nMLnMLiML

ML ML MLMLi i i i

i i ii i i

Maximum Likelihood: pros and cons

- Consistent, i.e., in the limit of a large amount of data, ML

estimate converges to the true parameters by which the data

are created.

- Simple

- Poor estimate when data are insufficient.

e.g., if you roll a die for less than 6 times, the ML estimate

for some numbers would be zero.

Pseudo counts:

where A = i i

Conditional Probability and Join Probability

P(one) = 5/13

P(square) = 8/13

P(one, square) = 3/13

P(one | square) = 3/8 = P(one, square) / P(square)

In general, P(D,M) = P(D|M)P(M) = P(M|D)P(D)

=> Baye’s Rule: ( | ) ( )

(M | D)( )

P D M P MP

Conditional Probability and Conditional Independence

Baye’s Rule:

Example: disease diagnosis/inference P(Leukemia | Fever) = ? P(Fever | Leukemia) = 0.85 P(Fever) = 0.9 P(Leukemia) = 0.005 P(Leukemia | Fever) = P(F|L)P(L)/P(F) = 0.85*0.01/0.9 = 0.0047

( | ) ( )(M | D)

P D M P MP

Bayesian Inference

Maximum a posterior estimate

arg max ( | x)MAP P

Expectation Maximization

log ( | ) log ( , | ) log (y | x, )P x P x y P

P( , | ) ( | , ) (x | )x y P y x P

( | ) ( , | ) / ( | , )P x P x y P y x

log ( | ) ( | , ) log ( , | ) ( | , ) log ( | x, )t t

P x P y x P x y P y x P y

( | ) ( | , ) logP(x, y | )t t

Q P y x

log ( | ) log ( | )

( | , )( | ) ( | ) ( | , ) log

( | , )

( | ) ( | )

arg max ( | )

tt t t t

P x P x

P y xQ Q P y x

( | , )t

P y x ( ) Expectation

Maximization

EM explanation of the Baum-Welch algorithm

( | ) ( | , )P x P x

( | ) ( | , ) log ( , | )t tQ P x P x

We like to

maximize by

choosing

But state path is

hidden variable. Thus,

EM Explanation of the Baum-Welch algorithm

A-term E-term

A-term is maximized if

E-term is maximized if

EM klkl

( )( )

E be b

(Summer 2015)

Hidden Markov Models (IV)

a. Profile HMMs

b. ipHMMs

c. GeneScan

d. TMMOD

Friedrich et al, Bioinformatics 2006

Interaction profile HMM (ipHMM)

GENSCAN (generalized HMMs)

• Chris Burge, PhD Thesis ’97, Stanford

• http://genes.mit.edu/GENSCAN.html

• Four components

– A vector π of initial probabilities

– A matrix T of state transition probabilities

– A set of length distribution f

– A set of sequence generating models P

• Generalized HMMs:

– at each state, emission is not symbols (or residues),

rather, it is a fragment of sequence.

– Modified viterbi algorithm

• Initial state probabilities

– As frequency for each functional unit to occur

in actual genomic data. E.g., as ~ 80% portion

are non-coding intergenic regions, the initial

probability for state N is 0.80

• Transition probabilities

• State length distributions

• Training data

– 2.5 Mb human genomic sequences

– 380 genes, 142 single-exon genes, 1492 exons

and 1254 introns

– 1619 cDNAs

Open areas for research

• Model building – Integration of domain knowledge, such as structural

information, into profile HMMs

– Meta learning?

• Biological mechanism DNA replication

• Hybrid models – Generalized HMM

– …

TMMOD: An improved hidden Markov model for

predicting transmembrane topology

globular Loop cyt

cyt Helix core

Non-cyt

Long loop

Non-cyt Cap

cyt Helix core

Non-cyt

Long loop

Non-cyt globular

globular

membrane Non-cytoplasmic side Cytoplasmic side

TMHMM by Krogh, A. et al JMB 305(2001)567-580

Accuracy of prediction for topology: 78%

0 20 40 60 80 100

No. lo

length

Mod. Reg. Data

Correct

topology

Correct

location

itivity

Speci-

ficity

TMMOD 1

65 (78.3%)

51 (61.4%)

64 (77.1%)

67 (80.7%)

52 (62.7%)

65 (78.3%)

TMMOD 2

61 (73.5%)

54 (65.1%)

65 (78.3%)

61 (73.5%)

66 (79.5%)

TMMOD 3

70 (84.3%)

64 (77.1%)

74 (89.2%)

71 (85.5%)

65 (78.3%)

74 (89.2%)

TMHMM S-83

64 (77.1%) 69 (83.1%) 96.2% 96.2%

PHDtm S-83

(85.5%) (88.0%) 98.8% 95.2%

TMMOD 1

117 (73.1%)

92 (57.5%)

117 (73.1%)

128 (80.0%)

103 (64.4%)

126 (78.8%)

TMMOD 2

120 (75.0%)

97 (60.6%)

118 (73.8%)

132 (82.5%)

121 (75.6%)

135 (84.4%)

TMMOD 3

120 (75.0%)

110 (68.8%)

135 (84.4%)

133 (83.1%)

124 (77.5%)

143 (89.4%)

TMHMM S-160 123 (76.9%) 134 (83.8%) 97.1% 97.7%

Hidden Markov Models (I)lliao/globex15/lec6.pdf · GLOBEX Bioinformatics (Summer 2015) Hidden...

Documents

EE365: Hidden Markov Models - Stanford Universityee266.stanford.edu/lectures/hmm.pdf · EE365: Hidden Markov Models Hidden Markov Models The Viterbi Algorithm 1. Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models · Hidden Markov Models and Gaussian Mixture Models ... Hidden Markov Model ... ASR Lectures 4&5 Hidden Markov Models and Gaussian

Hidden Markov Models

Overview Hidden Markov Models Gaussian Mixture Models · Hidden Markov Models and Gaussian Mixture Models ... Hidden Markov models HMM algorithms ... model ASR Lectures 4&5 Hidden

Hidden Markov Models - uml.edugrinstei/91.510/HMM/Class Sildes - Hidden Markov... · Hidden Markov Models. ... particular hidden state. • Thus a hidden Markov model is a standard

Hidden Markov Models./awm/tutorials/hmm14.pdf · Hidden Markov Models ... 14)

Hidden Markov Models in Bioinformaticscsatol/mach_learn/bemutato/Mate_Korosi_HMMpr… · Outline ˜ Markov Chain ˜ HMM (Hidden Markov Model) ˜ Hidden Markov Models in Bioinformatics

Inference in Mixed Hidden Markov Models and Applications ... · 2. From hidden Markov models to mixed hidden Markov models Many authors suggested applying hidden Markov models to

Hidden Markov Models - AUusers-cs.au.dk/cstorm/courses/PRiB_f12/slides/hidden-markov-model… · Hidden Markov Models Markov Model Hidden Markov Model If the latent variables are

9 Markov chains and Hidden Markov Models - Freie … · 9 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Algorithms: Viterbi, forward,

Hidden Markov Models -

L13: hidden Markov models - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/sp/l13.pdf · L13: hidden Markov models • Discrete Markov processes • Hidden Markov models

Markov Models and Hidden Markov Models (HMMs)