34
Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Master in Bioinformatics UPF 2014-2015

Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Hidden Markov Models for biological sequence

analysis II

Eduardo Eyras Computational Genomics

Pompeu Fabra University - ICREA Barcelona, Spain

Master in Bioinformatics UPF 2014-2015

Page 2: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

HMM model structure

Page 3: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Duration modelling

p

(1-p)

p = transition probability to itself, 1-p= probability of leaving the state

Probability of staying in the state for n residues = (1-p) pn Exponential decaying (geometric distribution)

How to avoid this decay? For instance, using several states with the same emission probabilities and transitions between each other

Eg: models sequences of minimum length 5, and exponential decaying for longer ones.

Page 4: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Duration modelling

Eg: this can model any distribution of lengths between 2 and 5

Page 5: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Duration modelling

p

(1-p)

p

(1-p)

p

(1-p) (1-p)

p

This type of array of n states can model sequences of length n or longer

For a path of length m: transition probabilities =

Transition probability over all possible paths of length m

P(m) =m −1n −1#

$ %

&

' ( pm−n (1− p)n

nk"

# $ %

& ' =

n!k!(n − k)!Where we use the Binomial coefficients

pm−n (1− p)n

Page 6: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Profile - HMMs

Page 7: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Finding distant members of a protein family

A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities with each member of the family and thus may fail to be found using standard pairwise methods (e.g. BLAST).

Even though they may have weak similarities with many members of

the family, the goal is to align a sequence to all members of the family at once.

Family of related proteins can be represented by their multiple

alignment and a corresponding profile. Can we represent the profile as a probabilistic model?

Page 8: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

We use a multiple alignment to build a profile-HMM. It is a HMM: It is a probabilistic representation of a multiple alignment and we can use the same HMM algorithms (Viterbi, etc…) We can add position-dependent gap penalties (to model gaps in the alignment) We can add variable states with position-dependent random emission probabilities (to model variable regions) This model then may be used to find and score less obvious potential matches of new protein sequences. The profile-HMM is used to ask whether a new sequence S belongs to a given model (e.g. a given family of proteins, e.g. contains a given domain).

Profile-HMM

Page 9: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Protein family can be represented by a profile representing frequencies of amino acids. E.g multiple alignment of SH3 domains:

Profiles and HMMs

Page 10: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

----exon----intron !!CAGGTACCC !!GAGGTGAGA !!CTGGTGAGG !!TAGGTGAGT !!CAGGTCTGT !!CTGGTGAGC !!CAGGTAAGT!

pos 1 2 3 4 5 6 7 8 9

A 0 0.71 0 0 0 0.28 0.71 0 0.14

C 0.71 0 0.28 0 0 0.14 0.14 0.14 0.28

G 0.14 0 0.71 1 0 0.57 0 0.85 0.14

T 0.14 0.28 0 0 1 0 0.14 0 0.42

E.g. position 1, P( C ) = frequency = 5/7 = 0.71

Profile representation of protein families

Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies of nucleotides in every aligned position

S = log ei(si)qii=1

L

∑Motif probabilities

Background probabilities

Position Specific Scoring Matrix (also PWM)

Page 11: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

The conserved regions can be modelled as in a PSSM

… … begin end Mj

A PSSM can be viewed as a trivial HMM with identical states, Match states, separated by transitions of probability 1

Score = logeM i

(si)qii=1

L

∑Emission probabilities

Background probabilities

Profile-HMM: Match states

eMi(a) Emission probability in Match state = frequency of each amino acid in alignment columns

Match states

Page 12: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Multiple alignment of a protein family shows variations in conservation along the length of a protein. E.g multiple alignment of SH3 domains:

Conserved regions can be described by PWMs but variable regions can not!

Profile-HMMs: Insertion states

Page 13: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Start End Mi

Ii

We treat insertions and deletions separately Insertion: portions of the query sequence S that do not match anything in the model: we must insert residues with respect to the model Insertion state: Ii = insertions after the residue matching the i th column of the alignment

eI i (a) = p(a) Emission probability in Insertion state = amino acid frequency in all sequences (background)

Profile-HMMs: Insertion states

Page 14: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Profile-HMMs: Insertion states

Start End Mi

Ii

Transitions of Ii to itself model multiple insertions There is no log-odds (log-likelihood ratio) for emissions from Ii Score of a gap of length k:

log(aM i I i) + (k −1)log(aI i I i ) + log(aI iM i+1

)

Open gap penalty Gap extension penalty Gap closing penalty

Gap penalties are position-dependent!!!! compare to e.g. Needleman-Wunsch

Page 15: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Profile-HMMs: Silent states Deletions: segments of the model that are not matched by any residue in the query sequence S. That is, trying to fit S to the model we need to jump match states: we must allow deletions in the query sequence One possibility to allow for deletions is to connect non-neighbouring states:

Too complex to model arbitrary deletions in a long sequence

We therefore introduce the Silent states Dj to model deletions

Page 16: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

We can model arbitrary deletions by connecting the states to a parallel chain of silent states (circles):

It is possible to get from any “real” state go any “real” state without emitting letters

Mj Start End

Dj

Profile-HMMs: Silent states

log(aM iDi) + log(aD jD j+1

)j= i+1

i+k−1

∑ + log(aDi+kM i+k+1)

Cost of a deletion of length k

The deletion extension has different probabilities (different states) The insertion extension is of equal contribution (same state)

Page 17: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

States in a profile-HMM

Start End Mi

Ii

Di

Match states: conserved positions in the alignment (plus start/end states)

Insertion states: variable regions (not clearly alignable)

Deletion states: model gaps in the alignment

Deletion state

Insertion state

Match state

Page 18: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Start End Mi

Ii

Di

We want to build a model representing the consensus of a family of sequences, not the sequence of any particular member.

Building a profile-HMM

Page 19: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Multiple alignment is used to construct the HMM model. Assign each aligned (conserved) column to a Match state (M) in the HMM – this will determine the length of the model Estimate the emission probabilities according to amino acid counts in columns. Different positions in the protein will have different emission probabilities. Add Insertion (I) and Deletion (D) states: all states, connectivity to be determined… Estimate the transition probabilities between Match, Deletion and Insertion states

Building a profile-HMM

Start End Mi

Ii

Di

Page 20: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Probabilities in a profile-HMM

Start End Mi

Ii

Di

eMi(a) Emission probability in Match state = frequency of each aminoacid in alignment columns

eI i (a) = p(a) Emission probability in Insertion state = aminoacid frequency in all sequences (background)

aMiI i Transition probability from match to insertion state

log(aM i I i) Open gap penalty

aIiI i Transition probability within a insertion state

log(aI i I i ) Extension gap penalty

aDiDi+1 Transition probability between deletion states

Page 21: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Probabilities in a profile-HMM

Start End Mi

Ii

Di

aDiIi = 0 Transition probability between a deletion and insertion states

aIiDi+1 = 0 Transition probability between insertion and deletion states

Usually very improbable

Page 22: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

How to assign the states?

Start End Mi

Ii

Di

Heuristic rules: Denote as insertion states, the columns from the alignment that contain gaps in more than half of the sequences. Denote as match the conserved ones and with less gaps Calculate the entropy for each column and denote as insertion state the columns with high degree of disorder In the example above, all columns will be M except for 4th and 5th that will be I states.

Page 23: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Profile-HMM Parameter estimation We start from a given sample of alignments We can estimate the parameters counting the transitions and emissions:

Akl Count the number of transitions between states k and l

Ek(b) Count the number of times the symbol b is emitted by state k

akl =AklAk '

l '∑

, ek(b) =Ek(b)Ek(b ')

b'∑

We can estimate the probabilities as follows:

To avoid overfitting, use pseudocounts:

Akl → Akl + rklEk(b)→ Ek(b)+ rk(b)

Pseudocounts reflect our prior knowledge

Accurate estimate for a large number of sequences

Page 24: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Parameter estimation: Example

eM1 (V ) = 5/ 7

eM1 (F) = eM1 (I ) = 1/ 7

eM1 (V ) = (5+1)/(7+ 20) = 6/ 27eM1 (I ) = eM1 (V ) = (1+1)/(7+ 20) = 2/ 27Using

pseudocounts

eM1 = 1/ 27 For all other aminoacids

aM1M2= (6+1)/(7+ 3) = 7/10

aM1D1 = (1+1)/(7+ 3) = 2/10aM1I1 = (0+1)/(7+ 3) = 1/10

Using pseudocounts

aM1M2= 6/ 7

aM1D2=1/ 7

aM1I1 = 0

Page 25: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Parameter estimation: Example

eM 3(A) = 3/6

eM 3(G) = 2 /6

eM 3(A) = (3+1) /(6 + 20) = 4 /26

eM 3(G) = (2 +1) /(6 + 20) = 3/26

Using pseudocounts

aM 3M 4= (4 +1) /(6+ 3) = 5 /9

aM 3D4 = (1+1) /(6+ 3) = 2 /9

aM 3I 3 = (1+1) /(6+ 3) = 2 /9

aD3M 4= (1+1) /(1+2) = 2 /3

aD3D4 = (0+1) /(1+2) =1/3

Using pseudocounts

(B)

(C) (D)

(A)

(B)

(C)

(D)

(A)

aM 3M 4= 4 /6

aM 3D4=1/6

aM 3I 3=1/6

aD3M 4=1/1

aD3D4 = 0Always check normalization!! Here we removed D->I

Page 26: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Searching with Profile-HMMs

Profile-HMMs can be used to detect a possible new member of a sequence family

We must compare the new sequence against the profile-HMM model

Start End Mi

Ii

Di

Page 27: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

We can use Viterbi to obtain the most probable path π* across the model and then calculate its probability:

We can use Forward to obtain the total probability for the sequence given the model:

P(S | Π∗ )

P(S) = P(s1...sL ) = P(π

∑ s1...sL ,π 0...π N )

We use in general the log-likelihood ratios (log-odds) with a background model

Searching with Profile-HMMs

Page 28: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

VjM( i) Best score (likelihood-ratio) for the best path of states aligning

the subsequence s1…si to the submodel up to state j, ending in the emission of si by Mj

VjI ( i)

VjD( i)

Best score for the best path ending at si being emitted by Ij

Best score for the best path ending at Dj

Profile HMM Viterbi

Page 29: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Profile HMM Viterbi

VjM( i) = log

eMj(si )qsi

+maxVj−1

M(i−1)+ log aMj−1Mj

Vj−1I (i−1)+ log aIj−1Mj

Vj−1D (i−1)+ log aDj−1Mj

#

$ %

& %

VjI ( i) = log

eIj (si )qsi

+maxVj

M(i−1)+ log aMjI j

VjI (i−1)+ log aIjI j

VjD(i−1)+ log aDjI j

#

$ %

& %

V jD (i) =max

V j−1M (i −1) + logaM j−1D j

V j−1I (i −1) + logaI j−1D j

V j−1D (i −1) + logaD j−1D j

#

$ %

& %

Page 30: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Profile HMM Viterbi

VjM( i) = log

eMj(si )qsi

+maxVj−1

M(i−1)+ log aMj−1Mj

Vj−1I (i−1)+ log aIj−1Mj

Vj−1D (i−1)+ log aDj−1Mj

#

$ %

& %

VjI ( i) = log

eIj (si )qsi

+maxVj

M(i−1)+ log aMjI j

VjI (i−1)+ log aIjI j

VjD(i−1)+ log aDjI j

#

$ %

& %

V jD (i) =max

V j−1M (i −1) + logaM j−1D j

V j−1I (i −1) + logaI j−1D j

V j−1D (i −1) + logaD j−1D j

#

$ %

& %

eIj (si ) = qsiDoes not contribute in general since

Are usually not present (negligible when scoring an alignment to the model)

Page 31: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Profile HMM Viterbi

V0M(0) = 0

Initialisation:

The start state is M0 such that

We allow the alignment to end in a deletion or insert state

We allow transitions to I0 and D1

The end state ML+1

Termination:

Score S |Π*( )=maxVL

M (n) + logaM L ,end

VLI (n) + logaIL ,end

VLD (n) + logaDL ,end

#

$ %

& %

Page 32: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

•  Use Blast to separate a protein database into families of related proteins •  Construct a multiple alignment for each protein family. •  Construct a profile HMM model and optimize the parameters of the model

(transition and emission probabilities)

•  Align the target sequence against each HMM to find the best fit between a target sequence and an HMM

Making a collection of Profile-HMM for protein families

Page 33: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

PFAM •  Pfam decribes protein domains (http://pfam.sanger.ac.uk/)

•  Each protein domain family in Pfam has: - Seed alignment: manually verified multiple alignment of a

representative set of sequences. - HMM built from the seed alignment for further database searches. - Full alignment generated automatically from the HMM

•  The distinction between seed and full alignments facilitates Pfam updates.

- Seed alignments are stable resources. - HMM profiles and full alignments can be updated with newly found

amino acid sequences. •  Pfam HMMs span entire domains that include both well-conserved motifs and

less-conserved regions with insertions and deletions.

•  It results in modeling complete domains that facilitates better sequence annotation and leeds to a more sensitive detection.

Page 34: Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

References

Biological  Sequence  Analysis:  Probabilis5c  Models  of  Proteins  and  Nucleic  Acids  Richard  Durbin,  Sean  R.  Eddy,  Anders  Krogh,  and  Graeme  Mitchison.    Cambridge  University  Press,  1999    Problems  and  Solu5ons  in  Biological  Sequence  Analysis‎  Mark  Borodovsky,  Svetlana  Ekisheva  Cambridge  University  Press,  2006    Bioinforma5cs  and  Molecular  Evolu5on  Paul  G.  Higgs  and  Teresa  AJwood.    Blackwell  Publishing  2005.    An  Introduc5on  to  Bioinforma5cs  Algorithms  (ComputaOonal  Molecular  Biology)  by  Neil  C.  Jones,  Pavel  A.  Pevzner.  MIT  Press,  2004