BINF6201/8201 Hidden Markov Models for Sequence Analysis 4 11-29-2011

BINF6201/8201

Hidden Markov Models for Sequence Analysis 4

11-29-2011

Choice of model topologyThe structure

(topology) and parameters together determine a HMM.

The parameters of a HHM can be determined by the Baum-Welch algorithm and other optimization methods,

The design of the topology of a HMM is based on the understanding of the problem and the data available to solve the problem.

Profile HMM for sequence familiesProfile HMMs are a special type of HMMs used to mode multiple

alignments of protein families.

Once a profile HMM is constructed for a protein family, it can be used to evaluate if a new sequence belongs to the family or not (the scoring problem).

The most probable path of a sequence generated by the model can be used to align the sequence to the members of the family (the decoding problem).

Matches Indels

Profile HMM for sequence familiesGiven a block of ungapped multiple alignment of a protein family, we

can use the following HMM to model the block.

Begin M1 … Mj … EndMn

e1(bi) ej(bi) eL(bi)

Here, Mj corresponds to the ungapped column j in the alignment, and is called a match state. Mj emits amino acid bi with probability ej(bi).

The transition probability between two adjacent match states Mj-1 and Mj is 1, i.e., a(j-1) j = 1, because Mj cannot transit to any other states or back to itself.

Profile HMM for sequence familiesSince we know the path of a sequence generated by the model, the

probability that a sequence x is generated by the model is,

L

j

L

jjjjjMM xexeaMxPMxP

jj1 1

).()()|,()|(1

To make this probability more meaningful, we can compare it with a background probability.

The probability that the sequence x is generated randomly (by random model R) is,

L

ixi

qRxP1

.)|(

The log odds ratio is, ,

)(log)(

1

L

i x

ii

iq

xexS

Therefore, this HMM is equivalent to a PSSM, and we score the sequence x with the PSSM of the block, which is more sensitive than using a general-purposed scoring matrices such as PAM and BLOSSUM in a pair-wise alignment.

which essentially is a position specific scoring/weigh matrix (PSSM).

Profile HMM for sequence families

Begin M1… Mj

… EndMn

Ij

Mj+1

bI qbej

)(

In this case, Mj can transit to the next match state Mj+1 or Ij, and Ij can move to Mj+1 or remain in Ij.

Ij emits an amino acid b with probability, which is usually set to the background frequency of the amino acid b, qb.

The log odds-ratio for generating an insertion sequence of length k is,

),(bejI

To model insertions after the match state Mj, we introduce in the model an insertion state Ij.

.log)1(loglog})(log{11

1

jjjjjjjj

j

jjj

j

jjj

IIMIIMMIk

x

xII

x

xIMakaaa

q

qa

q

qa

This is equivalent to an affine penalty function, but it is position dependent, therefore is more accurate.

jj IMa

jj IIa

1jjMIa

1jjMMa

Profile HMM for sequence familiesTo model deletions at some match states, we use a deletion state Dj at

each position j.

The deletion state Dj does not emit any signal, so it is called a silent state.

The penalty score for a deletion of length k starting at j will be,

kjkjjjjj MDDDDM aaa

111

log...loglog

Therefore, the penalty for deletions is not equivalent to an affine penalty function, but again, it is position dependent.

k deletions

Begin M1 Mj EndMj+k-1Mj+1Mj-1

DjDj-1D1 Dj+1 Dj+k-1

Mj+k

Dj+k…

…

…

…

Profile HMM for sequence families

The following profile model considers transitions between insertion and deletion states.

In this model each Mj, Dj and Ij has three transitions, except for the last position at which each state has only two transitions.

The complete profile HMM has the following structure if transitions between insertion and deletion states are not considered.

Leaving them out has little effect on scoring sequences, but may have problem when training the model.

Derive profile HMMs from multiple alignmentsGiven a multiple alignment of a protein family, we first determine how

many match states should be used to model the family.

A general rule is to treat columns in which less than 50% of sequences have a deletion as match states.

A segment of multiple alignment of hemoglobin proteins

Using this rule we model this segment of alignment by a mode having eight match states.

Derive profile HMMs from multiple alignmentsBased on the general design of profile HHMs, we have the following

model for the segment of the alignment.

From the alignment, we know the path of each sequence, therefore, the transition and emission probabilities can be estimated by the general formula,

.)(

)( , k

kk

k

klkl N

bEbe

N

Aa

A HMM of length 8

PseudocountsWhen counting events, to avoid zero probabilities, we usually add

pseudo-counts to the total counts.

The most simplest way to add pseudocounts is to add one to each frequency, this is called Laplace’s rule, e.g., using this rule, we have,

A slightly more sophisticated method is to add a quantity proportional to the background frequency.

V,I,F.a aeFeIeVe MMMM for 27/1 )( ;27/2)()( ;27/6)(1111

.aaa IMDMMM 10/1 ;10/2 ;10/7112121

,)(

)( AN

AqbEbe

k

bkM k

where, qb is the background frequency of amino acid b in the alignment.

For example, if we add A sequences to the alignment, we expect that Aqb of them will have a b at the position, then the emission probability of b at Mk is

Dirichlet prior distributionThis means that we add our prior knowledge to the counts, and it is

equivalent to that we compute the posterior probability of the theoretic value of eM(b), q after we see some counts of EM(b), n out of a total of K counts, ie.,

where Z is a normalization factor, and a1,…, ai,…,a20 are the parameters that determine the shape of the distribution.

Interestingly, it can be shown that the mean of qi is,

,1

),...,:(20

1

1201

i

ii

Zp

./ i i

ii

To see this, we need to do some mathematical derivation. If we consider the frequencies of 20 amino acids in a column in an alignment, these 20 frequencies are summed to 1.

These values will change for different columns in the alignment, so they are random variables, and they follow a Dirichlet distribution,

AK

Aqnbe

np

pnpnp b

M k

)( and ,)(

)()/()/(

Dirichlet prior distribution

Therefore, if we do not know the frequencies of the 20 amino acids in a column, we can use such a Dirichlet distribution to model the prior of these frequencies. The average frequency of amino acid i is qi.

Although the parameter A does not affect the average of frequency qi, it affects the shape of the distribution.

To see the effect of A on the shape of Dirichlet, let’s consider only one type of amino acids (e.g., acidic amino acids) with a prior frequency q, and a mean frequency q. The prior frequency of all the other amino acids is 1- q,, and its mean is 1-q.

Let ai=Aqi, then we have a Dirichlet prior distribution,

The mean of qi is qi,

.Z

pi

Aqi

i

20

1

1201

1),...,:(

. i i

ii

i

ii

i

ii

i qq

q

qA

Aq

Dirichlet prior distributionThe Dirichlet distribution of this frequency q is,

.Z

p qAAq 1)1(1 )1(1

)(

When the average of frequency of this type of amino acids is q=0.05, changing A, we have the following shapes of the Dirichlet distributions. Although the means of q are the same, the larger the value of A, the narrow the shape of the distribution

In general, when we have a high confidence of q, we use a large A value, otherwise, we should use a small A value.

Dirichlet prior distributionNow let’s consider posterior distribution after observing data using a

Dirichlet prior distribution. Let K be the total number of observed amino acids in a column, of which n are of the type that we are considering.

The likelihood for this to happen can be computed by a binomial distribution, .CnL nKnK

n )1()/(

The posterior distribution is,

.Z

C

pC

pnpnp

pnpnp

qAAqnKnKn

nKnKn

1)1(1 )1(1

)1(

)()1(

)()|()(

)()|()|(

Through normalization, we have,

.Z

np qAnKAqn 1)1(1 )1('

1)|(

Dirichlet prior distributionTherefore the posterior probability also follows a Dirichlet

distribution, but with different parameters.

The mean of the posterior distribution of q is,

AK

Aqn

qAnKAqn

Aqn

iii

)1(

/

This gives the justification that we can use pseudocounts Aqb to estimate the posterior frequency of the amino acids b.

The figure shows the posterior distribution p(q|n) when q=0.05 but the real frequency is 0.5.

When K is large, adding prior Aq has little effect on the probability, but when K is small, the effects could be big.

Application of profile HMMsOnce a profile HMM is constructed for a protein family, it can be used

to score a new sequence. The sequence can be also aligned to the family using the path decoded by the Viterbi algorithm or the forward and backward algorithm.

The two popular tools for profile HMM applications are free on line:

1. Hmmer: http://hmmer.janelia.org/

2. Sam: http://compbio.soe.ucsc.edu/sam.html

• Developed by Sean Eddy and colleagues in earlier 1990s.• It contains tools for building a HMM based on a multiple

alignment, and tools for searching a HMM database.• Hammer is also associated with the Pfam protein family

database at the same site

• The first profile HMMs tools developed by David Haussler and Andrew Krogh in earlier 1990s.

Documents

BINF6201/8201 Hidden Markov Models for Sequence Analysis 4 11-29-2011