Hidden Markov models and its application to bioinformatics

Hidden Markov models and itsapplication to bioinformatics

Overview

Hidden Markov models (HMMs)– A problem in bioinformatics solved by HMMs

• Multiple alignments

– Standard algorithms for HMMs• Algorithm for finding the most likely state transition, called

the Viterbi algorithm

• Algorithm for learning parameters from data, called the Baum-Welch algorithm

Problem in this talk (1/2)

Finding conserved regions of proteins (amino acid sequences)– Different species can have (almost) the same

functional proteins that shares common amino acid subsequences. Ancestor

mgdv.. mgdv..mgpv..

Time

Amino acidsequence ofthe ancestor

mgpg..

Problem in this talk (2/2)

Finding conserved regions of amino acid seqs.– Different species can have common subsequences of

amino acid sequences in the same functional protein.• Some amino acids were changed in evolutionary process.

Example) Amino acid seqs. of cytochrome C, a protein that transfers electrons for oxygen breathing.

Find common parts in multiple sequences efficiently.

Problem to be solved

Common subseq. ＝ conserved because they are functionally important?

humanmgdvekgkki fimkcsqcht vekggkhktg pnlhglfgrk ...

mousemgdvekgkki fvqkcaqcht vekggkhktg pnlhglfgrk ...

flymgvpagdvek gkklfvqrca qchtveaggk hkvgpnlhgl ...

Comparison of sequences

Arrange sequences s.t. # matched characters is maximized.– Align sequences with gaps “-.”– This process is called alignment.

Simply listed sequences

mgdvekgkki fimkcsqcht..

mgdvekgkki fvqkcaqcht..

mgvpagdvek gkklfvqrca..

Sequences with gaps

mg----dvek gkkifimkcsqcht..

mg----dvek gkkifvqkcaqcht..

mgvpagdvek gkklfvqrca..

Approaches to alignments

N: # seqs., L: max. length of seqs., M: # states in an HMM.

Dynamic programming

Hidden Markov models (HMMs)

BestAlignment

# sequencesin practice

Computation Time

Can be foundOften foundin practice

N (# seqs)

Time DP

HMM

O(2NLN) O(NM2L)

Only afew

Dozens seqs.applicable

Hidden Markov models

We move from a state to a state probabilistically. Markov property holds;

– A state transition and an outputted symbol are dependent on only the current state.

Example of HMM

1 2

3

0.7

0.1

0.2

0.1 0.4

0.5

0.50.4

0.1

a: 0.2b: 0.8

a: 0.6b: 0.4

Time course Time

1

a

3 2

b

1

b

Alignment with HMM

[Idea]– A state of HMM ＝ a position of alignment– Example

1

g

2 3

r

1

g

3

k

a

gar...g-k...

Arginine

Lysine

Similar amino acids

Arginine and Lysine

Arginine Lysine-COO

H3N+

C H

NH3

CH2

CH2

CH2

CH2

+

COO-

H3N+

C H

C

CH2

CH2

CH2

HN

NH3

NH2+

Serine

COO-

H3N+

C H

CH3

C OHH

-COO

H3N+

C H

H

Glycine

Alignment with HMM

[Idea]– A state of HMM ＝ a position of alignment.– Example

1

g

2 3

r

1

g

3

k

a

gar...g-k...

Arginine

Lysine

Similar amino acids

Profile HMM

HMM can describesimilar amino acids

with prob. and states.

Profile HMM

Each state represents difference (insertion, deletion) of symbols from a basis sequence.

m0

i0

d1

m1

i1

d2

m2

i2

m3State as match

(a symbol is output)

State as insertion(a symbol is output)

State as deletion(no symbol is output)

A basis seq. A-KVG

Inserted symbol Deleted symbol

ASR-GAnother seq.

Matched symbols

Alignment with HMM

State transition corresponds to alignments.

A - K V GA - R V GA S R - G

State transitions

m1 m2 m3m0

A K V Gm1 m2 m3m0

A R V Gm1 d2 m3m0

A R G

i0

Sm0

i0

d1

m1

i1

d2

m2

i2

m3

Overview

Hidden Markov models (HMM)– A problem in bioinformatics solved by HMMs





Prediction of the best state transition

Algorithm for prediction

Multiple sequences

AKVGARVGASRG

m0

i0

d1

m1

i1

d2

m2

i2

m3

Hidden Markov model State transition thatmaximizes the

probability

m1 m2 m3m0

A K V Gm1 m2 m3m0

A R V Gm1 d2 m3m0

A R G

i0

S

Enumeration

Compute probabilities for all possible state trans.

Seq. aba

10.2

a0.2

10.7

b0.8

10.7

a0.2

10.2

a0.2

10.7

b0.8

20.1

a0.6

10.2

a0.2

10.7

b0.8

30.2

a0.4

0.0031

0.0013

0.0018

HMM

1 2

3

0.7

0.1

0.2

0.1 0.4

0.5

0.50.4

0.1

a: 0.2b: 0.8

a: 0.6b: 0.4

a: 0.6b: 0.4

(#states)length

Impossible to compute in practice.

To find the best state transition

N: # seqs., L: max. length of seqs., M: #states in HMM.

EnumerationViterbi

algorithmState trans.

with max. prob.

Length of seqsin practice

Comp. time

Can be found

O(NML) O(NM2L)

Shortonly

Longer seqs.applicable

Can be found

L (length of seqs.)

Time Enumeration

Viterbi

Viterbi algorithm (1/4)

[Idea]: Combine state transitions.– Transition probabilities and output probabilities are

independent from past states.

Symbol a b a

1

2

3

1

2

3

1

2

3


(1). Computing prob. of state transitions.

Init.Prob.

0.2 1

0.3 2

0.5 3

steps

Trans.Prob.

1

2

3

symbols a b a

0.2

0.6

0.4outputProb.

0.7 0.028

0.50.090

0.1

0.002



Init.Prob.

0.2

0.3

0.5

1

2

3

steps

Trans.Prob.

1

2

3

symbols a b a

0.2

0.6

0.4OutputProb.

0.08

0.09

0.10

1

2

3

0.8

0.4

0.6

0.50.016

0.1

0.006

0.0500.7



InitProb.

0.2

0.3

0.5

1

2

3

step

Trans.Prob.

1

2

3

symbol a b a

0.2

0.6

0.4OutputProb.

0.08

0.09

0.10

1

2

3

0.8

0.4

0.6

0.050

0.018

0.030

0.2

0.6

0.4

0.010

0.011

0.012


(2). Tracing the state transition with max. prob.

InitProb.

0.2

0.3

0.5

1

2

3

step

Trans.Prob.

1

2

3

symbol a b a

0.2

0.6

0.4OutputProb.

0.08

0.09

0.10

1

2

3

0.8

0.4

0.6

0.050

0.018

0.030

0.2

0.6

0.4

0.010

0.011

0.012


Computing time: O(NM2L)– At each node, it takes O(M) time.– # nodes for a sequence is O(ML). step

InitProb.

0.2

0.3

0.5

1

2

3

Trans. Prob.

1

2

3

symbols a b a

0.2

0.6

0.4OutputProb.

0.08

0.09

0.10

1

2

3

0.8

0.4

0.6

0.050

0.018

0.030

0.2

0.6

0.4

0.010

0.011

0.012

Appropriate parameters

When parameters change, the best state transition can also change.

InitProb.

0.2

0.3

0.5

1

2

3

step

Trans.Prob.

1

2

3

symbol a b a

0.2

0.6

0.4OutputProb.

0.08

0.09

0.10

1

2

3

0.8

0.4

0.6

0.050

0.018

0.030

0.2

0.6

0.4

0.010

0.011

0.012

0.2 0.8 0.2

0.004

Overview

Hidden Markov models (HMM)– A problem in bioinformatics solved by HMMs





Learning (training) algorithm

Algorithm to find appropriate parameters– Baum-Welch algorithm

• Instance of EM (Expectation-Maximization) algorithms.

Set initial parametersat random.

# updates

Likelihood P(X|)Update parameters

Increase of likelihood<

Output parameters

yes

no

Flow of the B-W algorithm The update always increase the likelihood P(X|), where

X is a set of sequences.

The expectations of # Parameters are used.

Appropriate parameters

Probabilistic parameters– Better: Given data is generated with higher prob.

Casting dice30 times

Prob. Parameters (1) Prob. Parameters (2)

Prob. we observe theabove 30 casts.

(1/6)30 (1/3)30<

Spots

1 102 03 104 05 106 0

# Spots

1 1/62 1/63 1/64 1/65 1/66 1/6

Prob. Spots

1 1/32 03 1/34 05 1/36 0

Prob.

# occurrence of parameters

In updating trans. prob. ak,l

– # occurrence of transition from state k to state l.

An state transition

x1 xL-1

k l

xLxL-2

An state transition

x1 x2

k l

xLx3

Prob.

+1.244 = Ak,l ak,l

Expectation of occurrence of

Value of parameterupdated

.' ',

,,

l lk

lklk A

Aa

0.003

0.004

el (xi+1)

xi+1

ak,l

Forward prob.fk (i)

Computing # occurrence of param.

Expectation of # occurrence of state transition from state k to state l.

l

xi-1 xix1

k

A state transition

xi+2

A state transition

xL

Backward prob.bk (i +1)

a1,k

Forward probability

fk(i): forward probability

A state transition

x1 x2 xi-1

k

xi

=

+

+

x1 xi-2 xi-1

k

xi

1

x1 xi-2 xi-1

k

xi

M

f1(i-1)

ek (xi)

fM (i-1) aM,k

ek (xi)

Backward probl

bk(i): backward probability.

=

+

+

xi+2xi+1

k

xL

M

bM (i+1)

A state transition

xL-1xi+1

k

xL

ak , 1

xi+2

k

xL

1

xi+1

b1(i+1)

ek (xi)

ak,M

ek (xi)

Forward/Backward probabilities

Time to compute them at one sequence

aM,k

ek (xi)

a1,k

A state transition

x1 x2 xi-1

k

xi

=x1 xi-2 xi-1

k

xi

+

+

x1 xi-2 xi-1

k

xi

1

M

f1(i-1)

fM (i-1)

ek (xi)

fk(i) k: M variation, i: L variation.

Comp.O(M) times

For each seq., it takes O(M2L) time.

Conclusion

Alignment of multiple sequences– To find conserved regions of protein sequences.

Hidden Markov models (HMM)– profile HMM

• Describes alignments of multiple sequences.

– Prediction algorithm• Viterbi algorithm

– Learning (training) algorithm• Baum-Welch algorithm

– For efficiency, the forward and backward probabilities are used.

Documents

Hidden Markov models and its application to bioinformatics