17
CSE182-L10 HMM applications

CSE182-L10

Embed Size (px)

DESCRIPTION

CSE182-L10. HMM applications. Probability of being in specific states. What is the probability that we were in state k at step I? Pr[All paths that passed through state k at step I, and emitted x] Pr[All paths that emitted x]. The Forward Algorithm. - PowerPoint PPT Presentation

Citation preview

Page 1: CSE182-L10

CSE182-L10

HMM applications

Page 2: CSE182-L10

Probability of being in specific states

• What is the probability that we were in state k at step I?

Pr[All paths that passed through state k at step I, and emitted x]

Pr[All paths that emitted x]

=Pr[x,π i = k]

Pr[x]

Page 3: CSE182-L10

The Forward AlgorithmThe Forward Algorithm

• Recall v[i,j] : Probability of the most likely path the automaton chose in emitting x1…xi, and ending up in state j.

• Define f[i,j]: Probability that the automaton started from state 1, and emitted x1…xi

• What is the difference?

x1…xi

Page 4: CSE182-L10

Most Likely path versus Probability of Arrival

• There are multiple paths from states 1..j in which the automaton can output x1…xi

• In computing the viterbi path, we choose the most likely path– V[i,j] = maxπ Pr[x1…xi|π]

• The probability of emitting x1…xi and ending up in state j is given by– F[i,j] = ∑π Pr[x1…xi|π]

Page 5: CSE182-L10

The Forward Algorithm

• Recall that – v(i,j) = max lQ {v(i-1,l).A[l,j] }.ej(xi)

• Instead– F(i,j) = ∑lQ (F(i-1,l).A[l,j] ).ej(xi)

1j

Page 6: CSE182-L10

The Backward AlgorithmThe Backward Algorithm

• Define b[i,j]: Probability that the automaton started from state i, emitted xi+1…xn and ended up in the final state

x1…xixi+1…xn

i1 m

Page 7: CSE182-L10

Forward Backward Scoring

• F(i,j) = ∑lQ (F(i-1,l).A[l,j] ).ej(xi)

• B[i,j] = ∑lQ (A[j,l].el(xi+1) B(i+1,l))

• Pr[x,πi=k]=F(i,k) B(i,k)

Pr[x,π i = k]

Pr[x]=

F(i,k)B(i,k)

F(i,k)B(i,k)k

Page 8: CSE182-L10

Application of HMMs

• How do we modify this to handle indels?

0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.00.0 0.2 0.7 0.0 0.3 0.0 0.0 0.00.1 0.2 0.0 0.0 0.3 1.0 0.3 0.00.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0

ACGT

1 2 3 4 5 6 7 8

Page 9: CSE182-L10

Applications of the HMM paradigm

• Modifying Profile HMMs to handle indels• States Ii: insertion states• States Di: deletion states

0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.00.0 0.2 0.7 0.0 0.3 0.0 0.0 0.00.1 0.2 0.0 0.0 0.3 1.0 0.3 0.00.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0

ACGT

1 2 3 4 5 6 7 8

Page 10: CSE182-L10

Profile HMMs

• An assignment of states implies insertion, match, or deletion. EX: ACACTGTA

0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.00.0 0.2 0.7 0.0 0.3 0.0 0.0 0.00.1 0.2 0.0 0.0 0.3 1.0 0.3 0.00.0 0.2 0.0 0.4 0.3 0.0 0.5 0.0

ACGT

1 2 3 4 5 6 7 8

A

C

A ATGTC

Page 11: CSE182-L10

Viterbi Algorithm revisited

• Define vMj (i) as the log likelihood score of the

best path for matching x1..xi to profile HMM ending with xi emitted by the state Mj.

• vvIIjj(i)(i) andand vD

j(i) are defined similarly.are defined similarly.

Page 12: CSE182-L10

Viterbi Equations for Profile HMMsViterbi Equations for Profile HMMs

vMj-1(i-1) + log(A[Mj-1, Mj])

vMj(i) = log (eMj(xi)) + max vI

j-1(i-1) + log(A[Ij-1, Mj])

vDj-1(i-1) + log(A[Dj-1, Mj])

vMj(i-1) + log(A[Mj-1, Ij])

vIj(i) = log (eIj(xi)) + max vI

j(i-1) + log(A[Ij-1, Ij])

vDj(i-1) + log(A[Dj-1, Ij])

Page 13: CSE182-L10

Compositional SignalsCompositional Signals

CpG islands. In genomic sequence, the CG di-CpG islands. In genomic sequence, the CG di-nucleotide is rarely seennucleotide is rarely seen

CG helps methylation of C, and subsequent CG helps methylation of C, and subsequent mutation to T.mutation to T.

In regions around a gene, the methylation is In regions around a gene, the methylation is suppressed, and therefore CG is more suppressed, and therefore CG is more common. common.

CpG islands: Islands of CG on the genome.CpG islands: Islands of CG on the genome. How can you detect CpG islands? How can you detect CpG islands?

Page 14: CSE182-L10

An HMM for Genomic regionsAn HMM for Genomic regions

• Node A emits A Node A emits A with Prob. 1, and 0 with Prob. 1, and 0 for all other bases.for all other bases.

• The start and end The start and end node do not emit node do not emit any symbol.any symbol.

• All outgoing edges All outgoing edges from nodes are from nodes are equi-probable, equi-probable, except for the except for the ones coming out of ones coming out of C.C.

C

A

T

G

0.1

0.4

.25start

end

.25

Page 15: CSE182-L10

An HMM for CpG islandsAn HMM for CpG islands

• Node A emits A Node A emits A with Prob. 1, and 0 with Prob. 1, and 0 for all other bases.for all other bases.

• The start and end The start and end node do not emit node do not emit any symbol.any symbol.

• All outgoing edges All outgoing edges from nodes are from nodes are equi-probable, equi-probable, except for the except for the ones coming out of ones coming out of C.C.

C

A

T

G

0.25

0.25

0.25start

end

Page 16: CSE182-L10

HMM for detecting CpG IslandsHMM for detecting CpG Islands

• In the best parse of a genomic sequence, each base is assigned a state from the sets A, and B.

• Any substring with multiple states coming from B can be described as a CpG island.

C

A

T

G

startend

C

A

T

G0.1

0.4

startend

AB

Page 17: CSE182-L10

HMM: Summary

• HMMs are a natural technique for modeling many biological domains.

• They can capture position dependent, and also compositional properties.

• HMMs have been very useful in an important Bioinformatics application: gene finding.