Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational

Hidden Markov Models for biological sequence

analysis II

Eduardo Eyras Computational Genomics

Pompeu Fabra University - ICREA Barcelona, Spain

Master in Bioinformatics UPF 2014-2015

HMM model structure

Duration modelling

p

(1-p)

p = transition probability to itself, 1-p= probability of leaving the state

Probability of staying in the state for n residues = (1-p) pn Exponential decaying (geometric distribution)

How to avoid this decay? For instance, using several states with the same emission probabilities and transitions between each other

Eg: models sequences of minimum length 5, and exponential decaying for longer ones.

Duration modelling

Eg: this can model any distribution of lengths between 2 and 5

Duration modelling

p

(1-p)

p

(1-p)

p

(1-p) (1-p)

p

This type of array of n states can model sequences of length n or longer

For a path of length m: transition probabilities =

Transition probability over all possible paths of length m

€

P(m) =m −1n −1#

$ %

&

' ( pm−n (1− p)n

€

nk"

# $ %

& ' =

n!k!(n − k)!Where we use the Binomial coefficients

€

pm−n (1− p)n

Profile - HMMs

Finding distant members of a protein family

A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities with each member of the family and thus may fail to be found using standard pairwise methods (e.g. BLAST).

Even though they may have weak similarities with many members of

the family, the goal is to align a sequence to all members of the family at once.

Family of related proteins can be represented by their multiple

alignment and a corresponding profile. Can we represent the profile as a probabilistic model?

We use a multiple alignment to build a profile-HMM. It is a HMM: It is a probabilistic representation of a multiple alignment and we can use the same HMM algorithms (Viterbi, etc…) We can add position-dependent gap penalties (to model gaps in the alignment) We can add variable states with position-dependent random emission probabilities (to model variable regions) This model then may be used to find and score less obvious potential matches of new protein sequences. The profile-HMM is used to ask whether a new sequence S belongs to a given model (e.g. a given family of proteins, e.g. contains a given domain).

Profile-HMM

Protein family can be represented by a profile representing frequencies of amino acids. E.g multiple alignment of SH3 domains:

Profiles and HMMs

----exon----intron !!CAGGTACCC !!GAGGTGAGA !!CTGGTGAGG !!TAGGTGAGT !!CAGGTCTGT !!CTGGTGAGC !!CAGGTAAGT!

pos 1 2 3 4 5 6 7 8 9

A 0 0.71 0 0 0 0.28 0.71 0 0.14

C 0.71 0 0.28 0 0 0.14 0.14 0.14 0.28

G 0.14 0 0.71 1 0 0.57 0 0.85 0.14

T 0.14 0.28 0 0 1 0 0.14 0 0.42

E.g. position 1, P( C ) = frequency = 5/7 = 0.71

Profile representation of protein families

Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies of nucleotides in every aligned position

€

S = log ei(si)qii=1

L

∑Motif probabilities

Background probabilities

Position Specific Scoring Matrix (also PWM)

The conserved regions can be modelled as in a PSSM

… … begin end Mj

A PSSM can be viewed as a trivial HMM with identical states, Match states, separated by transitions of probability 1

€

Score = logeM i

(si)qii=1

L

∑Emission probabilities

Background probabilities

Profile-HMM: Match states

€

eMi(a) Emission probability in Match state = frequency of each amino acid in alignment columns

Match states

Multiple alignment of a protein family shows variations in conservation along the length of a protein. E.g multiple alignment of SH3 domains:

Conserved regions can be described by PWMs but variable regions can not!

Profile-HMMs: Insertion states

Start End Mi

Ii

We treat insertions and deletions separately Insertion: portions of the query sequence S that do not match anything in the model: we must insert residues with respect to the model Insertion state: Ii = insertions after the residue matching the i th column of the alignment

€

eI i (a) = p(a) Emission probability in Insertion state = amino acid frequency in all sequences (background)



Start End Mi

Ii

Transitions of Ii to itself model multiple insertions There is no log-odds (log-likelihood ratio) for emissions from Ii Score of a gap of length k:

€

log(aM i I i) + (k −1)log(aI i I i ) + log(aI iM i+1

)

Open gap penalty Gap extension penalty Gap closing penalty

Gap penalties are position-dependent!!!! compare to e.g. Needleman-Wunsch

Profile-HMMs: Silent states Deletions: segments of the model that are not matched by any residue in the query sequence S. That is, trying to fit S to the model we need to jump match states: we must allow deletions in the query sequence One possibility to allow for deletions is to connect non-neighbouring states:

Too complex to model arbitrary deletions in a long sequence

We therefore introduce the Silent states Dj to model deletions

We can model arbitrary deletions by connecting the states to a parallel chain of silent states (circles):

It is possible to get from any “real” state go any “real” state without emitting letters

Mj Start End

Dj

Profile-HMMs: Silent states

€

log(aM iDi) + log(aD jD j+1

)j= i+1

i+k−1

∑ + log(aDi+kM i+k+1)

Cost of a deletion of length k

The deletion extension has different probabilities (different states) The insertion extension is of equal contribution (same state)

States in a profile-HMM

Start End Mi

Ii

Di

Match states: conserved positions in the alignment (plus start/end states)

Insertion states: variable regions (not clearly alignable)

Deletion states: model gaps in the alignment

Deletion state

Insertion state

Match state

Start End Mi

Ii

Di

We want to build a model representing the consensus of a family of sequences, not the sequence of any particular member.

Building a profile-HMM

Multiple alignment is used to construct the HMM model. Assign each aligned (conserved) column to a Match state (M) in the HMM – this will determine the length of the model Estimate the emission probabilities according to amino acid counts in columns. Different positions in the protein will have different emission probabilities. Add Insertion (I) and Deletion (D) states: all states, connectivity to be determined… Estimate the transition probabilities between Match, Deletion and Insertion states

Building a profile-HMM

Start End Mi

Ii

Di

Probabilities in a profile-HMM

Start End Mi

Ii

Di

€

eMi(a) Emission probability in Match state = frequency of each aminoacid in alignment columns

€

eI i (a) = p(a) Emission probability in Insertion state = aminoacid frequency in all sequences (background)

€

aMiI i Transition probability from match to insertion state

€

log(aM i I i) Open gap penalty

€

aIiI i Transition probability within a insertion state

€

log(aI i I i ) Extension gap penalty

€

aDiDi+1 Transition probability between deletion states

Probabilities in a profile-HMM

Start End Mi

Ii

Di

aDiIi = 0 Transition probability between a deletion and insertion states

aIiDi+1 = 0 Transition probability between insertion and deletion states

Usually very improbable

How to assign the states?

Start End Mi

Ii

Di

Heuristic rules: Denote as insertion states, the columns from the alignment that contain gaps in more than half of the sequences. Denote as match the conserved ones and with less gaps Calculate the entropy for each column and denote as insertion state the columns with high degree of disorder In the example above, all columns will be M except for 4th and 5th that will be I states.

Profile-HMM Parameter estimation We start from a given sample of alignments We can estimate the parameters counting the transitions and emissions:

€

Akl Count the number of transitions between states k and l

€

Ek(b) Count the number of times the symbol b is emitted by state k

€

akl =AklAk '

l '∑

, ek(b) =Ek(b)Ek(b ')

b'∑

We can estimate the probabilities as follows:

To avoid overfitting, use pseudocounts:

€

Akl → Akl + rklEk(b)→ Ek(b)+ rk(b)

Pseudocounts reflect our prior knowledge

Accurate estimate for a large number of sequences

Parameter estimation: Example

€

eM1 (V ) = 5/ 7

€

eM1 (F) = eM1 (I ) = 1/ 7

€

eM1 (V ) = (5+1)/(7+ 20) = 6/ 27eM1 (I ) = eM1 (V ) = (1+1)/(7+ 20) = 2/ 27Using

pseudocounts

€

eM1 = 1/ 27 For all other aminoacids

€

aM1M2= (6+1)/(7+ 3) = 7/10

aM1D1 = (1+1)/(7+ 3) = 2/10aM1I1 = (0+1)/(7+ 3) = 1/10

Using pseudocounts

€

aM1M2= 6/ 7

aM1D2=1/ 7

€

aM1I1 = 0

Parameter estimation: Example

€

eM 3(A) = 3/6

€

eM 3(G) = 2 /6

€

eM 3(A) = (3+1) /(6 + 20) = 4 /26

eM 3(G) = (2 +1) /(6 + 20) = 3/26

Using pseudocounts

€

aM 3M 4= (4 +1) /(6+ 3) = 5 /9

aM 3D4 = (1+1) /(6+ 3) = 2 /9

aM 3I 3 = (1+1) /(6+ 3) = 2 /9

aD3M 4= (1+1) /(1+2) = 2 /3

aD3D4 = (0+1) /(1+2) =1/3

Using pseudocounts

(B)

(C) (D)

(A)

(B)

(C)

(D)

(A)

€

aM 3M 4= 4 /6

aM 3D4=1/6

aM 3I 3=1/6

aD3M 4=1/1

aD3D4 = 0Always check normalization!! Here we removed D->I

Searching with Profile-HMMs

Profile-HMMs can be used to detect a possible new member of a sequence family

We must compare the new sequence against the profile-HMM model

Start End Mi

Ii

Di

We can use Viterbi to obtain the most probable path π* across the model and then calculate its probability:

We can use Forward to obtain the total probability for the sequence given the model:

€

P(S | Π∗ )

€

P(S) = P(s1...sL ) = P(π

∑ s1...sL ,π 0...π N )

We use in general the log-likelihood ratios (log-odds) with a background model

Searching with Profile-HMMs

€

VjM( i) Best score (likelihood-ratio) for the best path of states aligning

the subsequence s1…si to the submodel up to state j, ending in the emission of si by Mj

€

VjI ( i)

€

VjD( i)

Best score for the best path ending at si being emitted by Ij

Best score for the best path ending at Dj

Profile HMM Viterbi

Profile HMM Viterbi

€

VjM( i) = log

eMj(si )qsi

+maxVj−1

M(i−1)+ log aMj−1Mj

Vj−1I (i−1)+ log aIj−1Mj

Vj−1D (i−1)+ log aDj−1Mj

#

$ %

& %

€

VjI ( i) = log

eIj (si )qsi

+maxVj

M(i−1)+ log aMjI j

VjI (i−1)+ log aIjI j

VjD(i−1)+ log aDjI j

#

$ %

& %

€

V jD (i) =max

V j−1M (i −1) + logaM j−1D j

V j−1I (i −1) + logaI j−1D j

V j−1D (i −1) + logaD j−1D j

#

$ %

& %

Profile HMM Viterbi

€

VjM( i) = log

eMj(si )qsi

+maxVj−1

M(i−1)+ log aMj−1Mj

Vj−1I (i−1)+ log aIj−1Mj

Vj−1D (i−1)+ log aDj−1Mj

#

$ %

& %

€

VjI ( i) = log

eIj (si )qsi

+maxVj

M(i−1)+ log aMjI j

VjI (i−1)+ log aIjI j

VjD(i−1)+ log aDjI j

#

$ %

& %

€

V jD (i) =max

V j−1M (i −1) + logaM j−1D j

V j−1I (i −1) + logaI j−1D j

V j−1D (i −1) + logaD j−1D j

#

$ %

& %

€

eIj (si ) = qsiDoes not contribute in general since

Are usually not present (negligible when scoring an alignment to the model)

Profile HMM Viterbi

€

V0M(0) = 0

Initialisation:

The start state is M0 such that

We allow the alignment to end in a deletion or insert state

We allow transitions to I0 and D1

The end state ML+1

Termination:

€

Score S |Π*( )=maxVL

M (n) + logaM L ,end

VLI (n) + logaIL ,end

VLD (n) + logaDL ,end

#

$ %

& %

•  Use Blast to separate a protein database into families of related proteins •  Construct a multiple alignment for each protein family. •  Construct a profile HMM model and optimize the parameters of the model

(transition and emission probabilities)

•  Align the target sequence against each HMM to find the best fit between a target sequence and an HMM

Making a collection of Profile-HMM for protein families

PFAM •  Pfam decribes protein domains (http://pfam.sanger.ac.uk/)

•  Each protein domain family in Pfam has: - Seed alignment: manually verified multiple alignment of a

representative set of sequences. - HMM built from the seed alignment for further database searches. - Full alignment generated automatically from the HMM

•  The distinction between seed and full alignments facilitates Pfam updates.

- Seed alignments are stable resources. - HMM profiles and full alignments can be updated with newly found

amino acid sequences. •  Pfam HMMs span entire domains that include both well-conserved motifs and

less-conserved regions with insertions and deletions.

•  It results in modeling complete domains that facilitates better sequence annotation and leeds to a more sensitive detection.

References

Biological Sequence Analysis: Probabilis5c Models of Proteins and Nucleic Acids Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Cambridge University Press, 1999 Problems and Solu5ons in Biological Sequence Analysis‎ Mark Borodovsky, Svetlana Ekisheva Cambridge University Press, 2006 Bioinforma5cs and Molecular Evolu5on Paul G. Higgs and Teresa AJwood. Blackwell Publishing 2005. An Introduc5on to Bioinforma5cs Algorithms (ComputaOonal Molecular Biology) by Neil C. Jones, Pavel A. Pevzner. MIT Press, 2004

Documents

Hidden Markov Models for biological sequence …regulatorygenomics.upf.edu/courses/Master_AGB/4_Hidden...Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational