1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler

1

Markov Chains

Algorithms in Computational Biology

Spring 2006

Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler

2

So far we assumed every letter in a sequence is sampled randomly from some distribution q()

This model could suffice for alignment scoring, but it is not the case in true genomes.

There are special subsequences in the genome in which dependencies between nucleotides exist

Example 1: TATA within the regulatory area, upstream a gene. Example 2: CG pairs

We model such dependencies by Markov chains and hidden Markov Models (HMMs)

Dependencies Along Biological Sequences

3

Markov Chains

A chain of random variables in which the next one depends only on the current Given X=x1…xn, then P(xi|x1…xi-1) =P(xi|xi-1)

The general case: kth–order Markov process Given X=x1…xn, then P(xi|x1…xi-1) =P(xi|xi-1…xi-k)

X1 X2 Xn-1 Xn

4

Markov Chains

An integer time stochastic process, consisting of a domain D of m>1 states {s1,…,sm} and

An m dimensional initial distribution vector (p(s1),.., p(s,)).

An m x m transition probabilities matrix M = (aij)

For example: D can be the letters {A, C, G, T} p(A) the probability of A to be the 1st letter in a sequence aAG the probability that G follows A in a sequence.

5

Markov Chains

For each integer n, a Markov Chain assigns probability to sequences (x1…xn) over D (i.e, xiD) as follows:

Similarly, (x1…xi…) is a sequence of probability distributions over D. There is a rich theory which studies the properties of these sequences.

1 2 1 1 1 12

(( , ,... )) ( ) ( | )n

n i i i ii

p x x x p X x p X x X x

11

2

( )i i

n

x xi

p x a

6

Matrix Representation

The transition probabilities Matrix M=(ast) M is a stochastic Matrix:

The initial distribution vector (U1…Um) defines the distribution of X1 P(X1= si)=Ui

Then after one move, the distribution changes to x2=x1M

0100

0.800.20

0.300.50.2

00.0500.95

A B

B

A

C

C

D

D1stt

a

7

Matrix Representation

Example:

if X1=(0, 1, 0, 0)Then X2=(0.2, 0.5, 0, 0.3)

And if X1=(0, 0, 0.5, 0.5) then X2=(0, 0.1, 0.5, 0.4).

The ith distribution is Xi=X1Mi-1

0100

0.800.20

0.300.50.2

00.0500.95

A B

B

A

C

C

D

D

8

Representing a Markov Model as a Digraph

0100

0.800.20

0.300.50.2

00.0500.95

A B

B

A

C

C

D

D

A B

C D

0.2

0.3

0.5

0.05

0.95

0.2

0.8

1

Each directed edge AB is associated with the transition probability from A to B.

9

Markov Chains – Weather Example

Weather forecast: raining today 40% rain tomorrow

60% no rain tomorrow No rain today 20% rain tomorrow

80% no rain tomorrow

Stochastic FSM:

rain no rain

0.60.4 0.8

0.2

10

Markov Chains – Gambler Example

Gambler starts with 10$

At each play we have one of the following: Gambler wins 1$ with probability p Gambler looses 1$ with probability 1-p

Game ends when gambler goes broke, or gains a fortune of 100$

0 1 2 99 100

p p p p

1-p 1-p 1-p 1-pStart (10$)

11

Properties of Markov Chain States

States of Markov chains are classified by the digraph representation (omitting the actual probability values)

Recurrent states: s is recurrent if it is accessible from

all states that are accessible from s. C and D are recurrent states.

Transient states: “s is transient” if it will be visited

a finite number of times as n. A and B are transient states.

A B

C D

12

Irreducible Markov Chains

A Markov Chain is irreducible if the corresponding graph is strongly connected (and thus all its states are recurrent).

A B

C D

A B

C D

E

13

Properties of Markov Chain states

A state s has a period k if k is the GCD of the lengths of all the cycles that pass via s.

Periodic states A state is periodic if it has a period k>1.

in the shown graph the period of A is 2.

Aperiodic states A state is aperiodic if it has

a period k=1. in the shown graph the period of F is 1.

A B

C D

E

F

14

Ergodic Markov Chains

A Markov chain is ergodic if: the corresponding graph is irreducible. It is not peridoic

Ergodic Markov Chains are important they guarantee the corresponding Markovian process converges

to a unique distribution, in which all states have strictly positive probability.

A B

C D

15

Stationary Distributions for Markov Chains

Let M be a Markov Chain of m states, and let V=(v1,…, vm) be a probability distribution over the m states V=(v1,…, vm) is stationary distribution for M if VM=V.

one step of the process does not change the distribution

V is a stationary distribution

V is a left (row) Eigenvector of M with Eigenvalue 1

16

“Good” Markov chains

A Markov Chains is good if the distributions Xi satisfy the following as i: converge to a unique distribution, independent of the initial

distribution In that unique distribution, each state has a positive probability

The Fundamental Theorem of Finite Markov Chains: A Markov Chain is good the corresponding graph is ergodic.

17

“Bad” Markov Chains

A Markov chains is not “good” if either : It does not converge to a unique distribution It does converge to a unique distribution, but some states in this

distribution have zero probability

For instance: Chains with periodic states Chains with transient states

18

An Example: Searching the Genome for CpG Islands

In the human genome, the pair CG appears less than expected the pair CG often transforms to (methyl-C) G which often transforms

to TG. Hence the pair CG appears less than expected from independent

frequencies of C and G alone.

Due to biological reasons, this process is sometimes suppressed in short stretches of genome such as in the start regions of many genes.

These areas are called CpG islands (p denotes “pair”).

19

CpG Islands

We consider two questions (and some variants): Question 1: Given a short stretch of genomic data, does it come

from a CpG island ? Question 2: Given a long piece of genomic data, does it contain

CpG islands in it, where, what length?

We “solve” the first question by modeling strings with and without CpG islands as Markov Chains States are {A,C,G,T} but Transition probabilities are different

20

CpG Islands

The “+” model Use transition matrix A+=(a+

st), Where:

a+st = (the probability that t follows s in a CpG island)

The “-” model Use transition matrix A-=(a-

st), Where:

A-st = (the probability that t follows s in a non CpG island)

21

CpG Islands

To solve Question 1 we need to decide whether a given short sequence of letters is more likely to come from the “+” model or from the “–” model.

This is done by using the definitions of Markov Chain, in which the parameters are determined by known data and the log odds-ratio test.

22

CpG Islands – the “+” Model

We need to specify p+(xi|xi-1) where + stands for CpG Island. From Durbin et al we have:

A C G T

A 0.180 0.274 0.426 0.120

C 0.171 0.368 0.274 0.188

G 0.161 0.339 0.375 0.125

T 0.079 0.355 0.384 0.182

23

CpG Islands – the “-” Model

p-(xi|xi-1) for non-CpG Island is given by:

A C G T

A 0.300 0.205 0.285 0.210

C 0.322 0.298 0.078 0.302

G 0.248 0.246 0.298 0.208

T 0.177 0.239 0.292 0.292

24

CpG Islands

Given a string X=(x1,…, xL), now compute the ratio

RATIO>1 CpG island is more likely RATIO<1 non-CpG island is more likely.

X1 X2 XL-1 XL

1

01

1

01

model) (

model) (RATIO

L

iii

L

iii

xxp

xxp

p

p

)|(

)|(

|

|

x

x

Documents

1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler