CSE182-L10

CSE182-L10

HMM applications

Probability of being in specific states

• What is the probability that we were in state k at step I?

Pr[All paths that passed through state k at step I, and emitted x]

Pr[All paths that emitted x]

€

=Pr[x,π i = k]

Pr[x]

The Forward AlgorithmThe Forward Algorithm

• Recall v[i,j] : Probability of the most likely path the automaton chose in emitting x1…xi, and ending up in state j.

• Define f[i,j]: Probability that the automaton started from state 1, and emitted x1…xi

• What is the difference?

x1…xi

Most Likely path versus Probability of Arrival

• There are multiple paths from states 1..j in which the automaton can output x1…xi

• In computing the viterbi path, we choose the most likely path– V[i,j] = maxπ Pr[x1…xi|π]

• The probability of emitting x1…xi and ending up in state j is given by– F[i,j] = ∑π Pr[x1…xi|π]

The Forward Algorithm

• Recall that – v(i,j) = max lQ {v(i-1,l).A[l,j] }.ej(xi)

• Instead– F(i,j) = ∑lQ (F(i-1,l).A[l,j] ).ej(xi)

1j

The Backward AlgorithmThe Backward Algorithm

• Define b[i,j]: Probability that the automaton started from state i, emitted xi+1…xn and ended up in the final state

x1…xixi+1…xn

i1 m

Forward Backward Scoring

• F(i,j) = ∑lQ (F(i-1,l).A[l,j] ).ej(xi)

• B[i,j] = ∑lQ (A[j,l].el(xi+1) B(i+1,l))

• Pr[x,πi=k]=F(i,k) B(i,k)

€

Pr[x,π i = k]

Pr[x]=

F(i,k)B(i,k)

F(i,k)B(i,k)k

∑

Application of HMMs

• How do we modify this to handle indels?

0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.00.0 0.2 0.7 0.0 0.3 0.0 0.0 0.00.1 0.2 0.0 0.0 0.3 1.0 0.3 0.00.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0

ACGT

1 2 3 4 5 6 7 8

Applications of the HMM paradigm

• Modifying Profile HMMs to handle indels• States Ii: insertion states• States Di: deletion states

0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.00.0 0.2 0.7 0.0 0.3 0.0 0.0 0.00.1 0.2 0.0 0.0 0.3 1.0 0.3 0.00.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0

ACGT

1 2 3 4 5 6 7 8

Profile HMMs

• An assignment of states implies insertion, match, or deletion. EX: ACACTGTA

0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.00.0 0.2 0.7 0.0 0.3 0.0 0.0 0.00.1 0.2 0.0 0.0 0.3 1.0 0.3 0.00.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0

ACGT

1 2 3 4 5 6 7 8

A

C

A ATGTC

Viterbi Algorithm revisited

• Define vMj (i) as the log likelihood score of the

best path for matching x1..xi to profile HMM ending with xi emitted by the state Mj.

• vvIIjj(i)(i) andand vD

j(i) are defined similarly.are defined similarly.

Viterbi Equations for Profile HMMsViterbi Equations for Profile HMMs

vMj-1(i-1) + log(A[Mj-1, Mj])

vMj(i) = log (eMj(xi)) + max vI

j-1(i-1) + log(A[Ij-1, Mj])

vDj-1(i-1) + log(A[Dj-1, Mj])

vMj(i-1) + log(A[Mj-1, Ij])

vIj(i) = log (eIj(xi)) + max vI

j(i-1) + log(A[Ij-1, Ij])

vDj(i-1) + log(A[Dj-1, Ij])

Compositional SignalsCompositional Signals

CpG islands. In genomic sequence, the CG di-CpG islands. In genomic sequence, the CG di-nucleotide is rarely seennucleotide is rarely seen

CG helps methylation of C, and subsequent CG helps methylation of C, and subsequent mutation to T.mutation to T.

In regions around a gene, the methylation is In regions around a gene, the methylation is suppressed, and therefore CG is more suppressed, and therefore CG is more common. common.

CpG islands: Islands of CG on the genome.CpG islands: Islands of CG on the genome. How can you detect CpG islands? How can you detect CpG islands?

An HMM for Genomic regionsAn HMM for Genomic regions

• Node A emits A Node A emits A with Prob. 1, and 0 with Prob. 1, and 0 for all other bases.for all other bases.

• The start and end The start and end node do not emit node do not emit any symbol.any symbol.

• All outgoing edges All outgoing edges from nodes are from nodes are equi-probable, equi-probable, except for the except for the ones coming out of ones coming out of C.C.

C

A

T

G

0.1

0.4

.25start

end

.25

An HMM for CpG islandsAn HMM for CpG islands

• Node A emits A Node A emits A with Prob. 1, and 0 with Prob. 1, and 0 for all other bases.for all other bases.

• The start and end The start and end node do not emit node do not emit any symbol.any symbol.

• All outgoing edges All outgoing edges from nodes are from nodes are equi-probable, equi-probable, except for the except for the ones coming out of ones coming out of C.C.

C

A

T

G

0.25

0.25

0.25start

end

HMM for detecting CpG IslandsHMM for detecting CpG Islands

• In the best parse of a genomic sequence, each base is assigned a state from the sets A, and B.

• Any substring with multiple states coming from B can be described as a CpG island.

C

A

T

G

startend

C

A

T

G0.1

0.4

startend

AB

HMM: Summary

• HMMs are a natural technique for modeling many biological domains.

• They can capture position dependent, and also compositional properties.

• HMMs have been very useful in an important Bioinformatics application: gene finding.

Documents

CSE182-L10