Biological Sequences and Hidden Markov Models CBPS7711 Sept 9, 2010

Center for Genes, Environment, and Health

Biological Sequences and Hidden Markov Models

CBPS7711Sept 9, 2010

Sonia Leach, PhDAssistant Professor

Center for Genes, Environment, and HealthNational Jewish Health

sonia.leach@gmail.com

Slides created from David Pollock’s slides from last year 7711 and current reading list from CBPS711 website

Andrey Markov 1856-1922

2 Center for Genes, Environment, and Health

Introduction• Despite complex 3-D structure, biological molecules have

primary linear sequence (DNA, RNA, protein) or have linear sequence of features (CpG islands, models of exons, introns, regulatory regions, genes)

• Hidden Markov Models (HMMs) are probabilistic models for processes which transition through a discrete set of states, each emitting a symbol (probabilistic finite state machine)

• HMMs exhibit the ‘Markov property:’ the conditional probability distribution of future states of the process depends only upon the present state (memory-less)

• Linear sequence of molecules/features ismodelled as a path through states of the HMMwhich emit the sequence of molecules/features

• Actual state is hidden and observed only through output symbols

Hidden Markov Model• Finite set of N states X• Finite set of M observations O• Parameter set λ = (A, B, π)

– Initial state distribution πi = Pr(X1 = i)– Transition probability aij = Pr(Xt=j | Xt-1 = i)– Emission probability bik = Pr(Ot=k | Xt = i)

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

1.09.00

8.02.00

75.09.01.0

Example:

Hidden Markov Model• Finite set of N states X• Finite set of M observations O• Parameter set λ = (A, B, π)

– Initial state distribution πi = Pr(X1 = i)– Transition probability aij = Pr(Xt=j | Xt-1 = i)– Emission probability bik = Pr(Ot=k | Xt = i)

Hidden Markov Model (HMM)

XtX t-1

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

1.09.00

8.02.00

75.09.01.0

Example:

Probabilistic Graphical Models

Observability Utility Observabilityand Utility

MarkovDecisionProcess (MDP)

A tA t−1

X tX t −1

U tU t−1

PartiallyObservableMarkovDecisionProcess (POMDP)

A t−1A t

X tX t −1

OtO t−1

U tU t−1

Markov Process (MP)X tX t −1

Hidden Markov Model (HMM)

XtX t-1

Three basic problems of HMMs1. Given the observation sequence O=O1,O2,

…,On, how do we compute Pr(O| λ)?

2. Given the observation sequence, how do we choose the corresponding state sequence X=X1,X2,…,Xn which is optimal?

3. How do we adjust the model parameters to maximize Pr(O| λ)?

• Probability of O is sum over all state sequencesPr(O|λ) = ∑

all X Pr(O|X, λ) Pr(X|λ)

all X πx1

bx1o1 ax1x2 bx2o2 . . . axn-1xn bxnon

• Efficient dynamic programming algorithm to do this: Forward algorithm (Baum and Welch)

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

1.09.00

8.02.00

75.09.01.0

Example:

πi = Pr(X1 = i)aij = Pr(Xt=j | Xt-1 = i)bik = Pr(Ot=k | Xt = i)

A Simple HMMCpG Islands where in one state, much higher

probability to be C or G

G .1C .1A .4T .4

G .3C .3A .2T .2 0.1

CpG Non-CpG

0.8 0.9

From David Pollock

The Forward AlgorithmProbability of a Sequence is the Sum of All

Paths that Can Produce It

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

Adapted from David Pollock’s

Pr(G|λ) = πC bCG + πN bNG

= .5*.3 + .5*.1For convenience, let’s drop the0.5 for now and add it in later

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

For O=GC have 4 possible state sequences CC,NC, CN,NN

(.3*.8+.1*.1)*.3=.075

(.3*.2+.1*.9)*.1=.015

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

(.3*.8+.1*.1)*.3=.075

(.3*.2+.1*.9)*.1=.015

(.075*.8+

.015*.1)*.3=.0185

(.075*.2+

.015*.9)*.1=.0029

For O=GCG have possible state sequences CCC, CCNNCC, NCN NNC, NNNCNC, CNN

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

(.3*.8+.1*.1)*.3=.075

(.3*.2+.1*.9)*.1=.015

(.075*.8+

.015*.1)*.3=.0185

(.075*.2+

.015*.9)*.1=.0029

For O=GCG have possible state sequences CCC, CCNNCC, NCN NNC, NNNCNC, CNN

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

(.3*.8+.1*.1)*.3=.075

(.3*.2+.1*.9)*.1=.015

(.075*.8+

.015*.1)*.3=.0185

(.075*.2+

.015*.9)*.1=.0029

(.0185*.8+.0029*.1

)*.2=.003

(.0185*.2+.0029*.9)

*.4=.0025

(.003*.8+

.0025*.1)*.2=.0005

(.003*.2+|

.0025*.9)*.4=.0011

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

(.3*.8+.1*.1)*.3=.075

(.3*.2+.1*.9)*.1=.015

(.075*.8+

.015*.1)*.3=.0185

(.075*.2+

.015*.9)*.1=.0029

(.0185*.8+.0029*.1

)*.2=.003

(.0185*.2+.0029*.9)

*.4=.0025

(.003*.8+

.0025*.1)*.2=.0005

(.003*.2+|

.0025*.9)*.4=.0011

Problem 1: Pr(O| λ)=0.5*.0005 + 0.5*.0011= 8e-4

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

(.3*.8+.1*.1)*.3=.075

(.3*.2+.1*.9)*.1=.015

(.075*.8+

.015*.1)*.3=.0185

(.075*.2+

.015*.9)*.1=.0029

(.0185*.8+.0029*.1

)*.2=.003

(.0185*.2+.0029*.9)

*.4=.0025

(.003*.8+

.0025*.1)*.2=.0005

(.003*.2+|

.0025*.9)*.4=.0011

Problem 2: What is optimal state sequence?

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

(.3*.8+.1*.1)*.3=.075

(.3*.2+.1*.9)*.1=.015

(.075*.8+

.015*.1)*.3=.0185

(.075*.2+

.015*.9)*.1=.0029

(.0185*.8+.0029*.1

)*.2=.003

(.0185*.2+.0029*.9)

*.4=.0025

(.003*.8+

.0025*.1)*.2=.0005

(.003*.2+|

.0025*.9)*.4=.0011

The Viterbi AlgorithmMost Likely Path (use max instead of sum)

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

Adapted from David Pollock’s(note error in formulas on his)

max(.3*.8,.1*.1)*.3= .072

max(.3*.2,.1*.9)*.1=.009

max(.072*.8,.009*.1)*.3= .0173

max(.072*.2,.009*.9)*.1=.0014

max(.0173*.8,.0014*.1)*.2=.0028

max(.0173*.2+.0014*.9)*.4=.0014

max(.0028*.8,.0014*.1)*.2=.00044

max(.0028*.2,.0014*.9)*.4=.0005

The Viterbi AlgorithmMost Likely Path: Backtracking

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

.3*.8,

.1*.1)*.3= .072

max(.3*.2,.1*.9)*.1=.009

.072*.8,

.009*.1)*.3= .0173

max(.072*.2,.009*.9)*.1=.0014

max(.0173*.8,.0014*.1)*.2=.0028

.0173*.2+.0014*.9)*.4=.0014

max(.0028*.8,.0014*.1)*.2=.00044

max(.0028*.2,

.0014*.9)*.4

=.0005

Forward-backward algorithm

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

(.3*.8+.1*.1)*.3=.075

(.3*.2+.1*.9)*.1=.015

(.075*.8+

.015*.1)*.3=.0185

(.075*.2+

.015*.9)*.1=.0029

(.0185*.8+.0029*.1

)*.2=.003

(.0185*.2+.0029*.9)

*.4=.0025

(.003*.8+

.0025*.1)*.2=.0005

(.003*.2+|

.0025*.9)*.4=.0011

AProblem 3: How to learn model?Forward algorithm calculated Pr(O1..t,Xt=i| λ)

Parameter estimation by Baum-Welch Forward Backward Algorithm

Forward variable αt(i) =Pr(O1..t,Xt=i | λ)Backward variable βt(i) =Pr(Ot+1..N|Xt=i, λ)

Rabiner 1989

Homology HMM• Gene recognition, classify to identify distant

homologs• Common Ancestral Sequence

– Parameter set λ = (A, B, π), strict left-right model– Specially defined set of states: start, stop, match, insert,

delete– For initial state distribution π, use ‘start’ state– For transition matrix A use global transition probabilities– For emission matrix B

• Match, site-specific emission probabilities

• Insert (relative to ancestor), global emission probs

• Delete, emit nothing

• Multiple Sequence Alignments

Homology HMM

insert insert

delete delete

match end

insert

Homology HMM Example

A .1C .05D .2E .08F .01

A .04C .1D .01E .2F .02

A .2C .01D .05E .1F .06

insert

delete

insert

match match

insert

delete

24 Center for Genes, Environment, and HealthEddy, 1998

Ungapped blocks

Ungapped blocks where insertion statesmodel intervening sequence between blocks

Insert/delete statesallowed anywhere

Allow multiple domains,sequence fragments

Homology HMM• Uses

– Find homologs to profile HMM in database• Score sequences for match to HMM

– Not always Pr(O| λ) since some areas may highly diverge– Sometimes use ‘highest scoring subsequence’

• Goal is to find homologs in database

– Classify sequence using libraryof profile HMMs

• Compare alternative models

– Alignment of additional sequences– Structural alignment when alphabet is secondary

structure symbols so can do fold-recognition, etc

Why Hidden Markov Models for MSA?• Multiple sequence alignment as consensus

– May have substitutions, not all AA are equal

– Could use regular expressions but how to handle indels?

– What about variable-length members of family?

FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVV 112FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMV 112

FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPS-TGAYARAGVV 112FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQS-AGAYARAGMV 112FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQNRG-HPYGVPAPAPPAAYSRPAVL 112

FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTPS-TGAYARAGVV 112FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTQS-AGAYARAGMV 112FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVPAPAPPAAYSRPAVL 112FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTS----YSTPGLS 110FOSB_HUMAN VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS----YSTPGMS 110

Why Hidden Markov Models?• Rather than consensus sequence which describes the

most common amino acid per position, HMMs allow more than one amino acid to appear at each position

• Rather than profiles as position specific scoring matrices (PSSM) which assign a probability to each amino acid in each position of the domain and slide fixed-length profile along a longer sequence to calculate score, HMMs model probability of variable length sequences

• Rather than regular expressions which can capture variable length sequences yet specify a limited subset of amino acids per position, HMMs quantify difference among using different amino acids at each position

Model Comparison• Based on

– For ML, take• Usually to avoid numeric

– For heuristics, “score” is– For Bayesian, calculate

– Uses ‘prior’ information on parameters

P(D |, M)

Pmax (D |, M)

lnPmax (D |, M)

log2 P(D | fixed ,M)

Pmax (, M | D) P(D |, M) * P * P M

P(D |, M) * P * P M )(P

Parameters, • Types of parameters

– Amino acid distributions for positions (match states)

– Global AA distributions for insert states– Order of match states– Transition probabilities– Phylogenetic tree topology and branch lengths– Hidden states (integrate or augment)

• Wander parameter space (search)– Maximize, or move according to posterior

probability (Bayes)

Expectation Maximization (EM)• Classic algorithm to fit probabilistic model

parameters with unobservable states• Two Stages

– Maximize• If know hidden variables (states), maximize model

parameters with respect to that knowledge

– Expectation• If know model parameters, find expected values of

the hidden variables (states)

• Works well even with e.g., Bayesian to find near-equilibrium space

Homology HMM EM• Start with heuristic MSA (e.g., ClustalW)• Maximize

– Match states are residues aligned in most sequences

– Amino acid frequencies observed in columns

• Expectation– Realign all the sequences given model

• Repeat until convergence• Problems: Local, not global optimization

– Use procedures to check how it workedAdapted from David Pollock’s

Model Comparison• Determining significance depends on

comparing two models (family vs non-family)– Usually null model, H0, and test model, H1

– Models are nested if H0 is a subset of H1

– If not nested• Akaike Information Criterion (AIC) [similar to

empirical Bayes] or • Bayes Factor (BF) [but be careful]

• Generating a null distribution of statistic– Z-factor, bootstrapping, , parametric

bootstrapping, posterior predictive

Z Test Method• Database of known negative controls

– E.g., non-homologous (NH) sequences– Assume NH scores

• i.e., you are modeling known NH sequence scores as a normal distribution

– Set appropriate significance level for multiple comparisons (more below)

• Problems– Is homology certain?– Is it the appropriate null model?

• Normal distribution often not a good approximation

– Parameter control hard: e.g., length distribution

~ N(, )

Bootstrapping and Parametric Models

• Random sequence sampled from the same set of emission probability distributions– Same length is easy– Bootstrapping is re-sampling columns– Parametric uses estimated frequencies, may

include variance, tree, etc.• More flexible, can have more complex null• Pseudocounts of global frequencies if data limit

• Insertions relatively hard to model– What frequencies for insert states? Global?

Homology HMM Resources• UCSC (Haussler)

– SAM: align, secondary structure predictions, HMM parameters, etc.

• WUSTL/Janelia (Eddy)– Pfam: database of pre-computed HMM

alignments for various proteins– HMMer: program for building HMMs

Why Hidden Markov Models?• Multiple sequence alignment as consensus

– May have substitutions, not all AA are equal

– Could use regular expressions but how to handle indels?

– What about variable-length members of family?

– (but don’t accept everything – typically introduce gap penalty)

FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVV 112FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMV 112

FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPS-TGAYARAGVV 112FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQS-AGAYARAGMV 112FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQNRG-HPYGVPAPAPPAAYSRPAVL 112

FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTPS-TGAYARAGVV 112FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTQS-AGAYARAGMV 112FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVPAPAPPAAYSRPAVL 112FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTS----YSTPGLS 110FOSB_HUMAN VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS----YSTPGMS 110

Why Hidden Markov Models?• Rather than consensus sequence which describes the

most common amino acid per position, HMMs allow more than one amino acid to appear at each position

• Rather than profiles as position specific scoring matrices (PSSM) which assign a probability to each amino acid in each position of the domain and slide fixed-length profile along a longer sequence to calculate score, HMMs model probability of variable length sequences

• Rather than regular expressions which can capture variable length sequences yet specify a limited subset of amino acids per position, HMMs quantify difference among using different amino acids at each position

Acknowledgements

Biological Sequences and Hidden Markov Models CBPS7711 Sept 9, 2010

Documents

Chapter 4 An Introduction to Hidden Markov Models for ... · Chapter 4 An Introduction to Hidden Markov Models for ... 4 An Introduction to Hidden Markov Models for Biological Sequences

Lecture 6a: Introduction to Hidden Markov Models · Lecture 6a: Introduction to Hidden Markov Models ... Markov Chain/Hidden Markov Model ... The states are hidden from the

Chapter 4 An Introduction to Hidden Markov Models for ...xhx/courses/references/krogh_anintroduction.pdf · 4 An Introduction to Hidden Markov Models for Biological Sequences 1

Clustering Sequences with Hidden Markov Modelspapers.nips.cc/paper/1217-clustering-sequences-with...Clustering Sequences with Hidden Markov Models Padhraic Smyth Information and Computer

Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences

Hidden Markov Models - ULisboa · 8 Markov Sequence Models • There is a 1-1 correspondence between sequences and paths through a (non-hidden) Markov model. • The probability of

Hidden Markov Models - Indiana UniversityHidden Markov Models Chen Yu Indiana University Introduction • Modeling dependencies in input. • Sequences: – Temporal: In speech; phonemes

Hidden Markov Models./awm/tutorials/hmm14.pdf · Hidden Markov Models ... 14)

Sequence Classification - with emphasis on Hidden Markov …lazebnik/fall09/sequence... · 2009. 9. 30. · Sequential Data Methods Hidden Markov Models Kernels for Sequences

9 Markov chains and Hidden Markov Models - Freie … · 9 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Algorithms: Viterbi, forward,

Real Time Viterbi Optimization of Hidden Markov …...hidden Markov models is describe in Section 2. This in-cludes our extensions to handle inﬁnite time sequences, Section 2.2,

Analysis of biological sequences using Markov Chains and Hidden Markov Models

Hidden Markov Models in Bioinformaticscsatol/mach_learn/bemutato/Mate_Korosi_HMMpr… · Outline ˜ Markov Chain ˜ HMM (Hidden Markov Model) ˜ Hidden Markov Models in Bioinformatics

Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France prum@genopole.cnrs.fr Colloque

EE365: Hidden Markov Models - Stanford Universityee266.stanford.edu/lectures/hmm.pdf · EE365: Hidden Markov Models Hidden Markov Models The Viterbi Algorithm 1. Hidden Markov Models

Hidden Markov Models John Goldsmith. Markov model A markov model is a probabilistic model of symbol sequences in which the probability of the current

Extraction of Hidden Markov Model Representations of ... · Extraction of Hidden Markov Model Representations of Signal Patterns in. DNA Sequences ... In this method, the Genetic

Sequence Classification - with emphasis on Hidden Markov ...cs.unc.edu/~lazebnik/fall09/sequence_classification.pdf · Hidden Markov Models ... with emphasis on Hidden Markov Models

Estimating hidden semi-Markov chains from discrete sequences€¦ · Hidden semi-Markov chains are particularly useful to model the succession of homogeneous zones or segments along

Hidden Markov Model Nov 11, 2008 Sung-Bae Cho. Hidden Markov Model Inference of Hidden Markov Model Path Tracking of HMM Learning of Hidden Markov Model