Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local...

Preview:

Citation preview

Intro to Alignment Intro to Alignment Algorithms: Algorithms:

Global and LocalGlobal and Local

Algorithmic Functions of Computational Biology

Professor Istrail

Sequence ComparisonBiomolecular sequences DNA sequences (string over 4 letter alphabet {A, C,

G, T}) RNA sequences (string over 4 letter alphabet

{ACGU}) Protein sequences (string over 20 letter alphabet

{Amino Acids})

Sequence similarity helps in the discovery of genes, and the prediction of structure and function of proteins.

Algorithmic Functions of Computational Biology

Professor Istrail

The Basic Similarity Analysis Algorithm

Global Similarity

• Scoring Schemes

• Edit Graphs

• Alignment = Path in the Edit Graph

• The Principle of Optimality

• The Dynamic Programming Algorithm

• The Traceback

Algorithmic Functions of Computational Biology –

Professor Istrail

Jupiter’s code: Alignment

MassaMaster

Mas-saMaster

Mass-aMaster

Massa-Master

Algorithmic Functions of Computational Biology

Professor Istrail

Sequence AlignmentInput: two sequences over the same alphabetOutput: an alignment of the two sequences

Example: GCGCATTTGAGCGA TGCGTTAGGGTGACCAA possible alignment: - GCGCATTTGAGCGA - -

TGCG - - TTAGGGTGACC

matchmismatch

indel

Algorithmic Functions of Computational Biology –

Professor Istrail

m

n

yyyY

xxxX

...21

...21

Consider two sequences

Over the alphabet

},,, TGCA

ji yx , belong to

Algorithmic Functions of Computational Biology –

Professor Istrail

Scoring Schemes

Unit-score A C G T

A

C

G

T

1

1

1

1-

-

0

0

00

0

0

0

0 00

0

0

00

0

0

00

0

00

Algorithmic Functions of Computational Biology –

Professor Istrail

Alignment

ACG| | |AGG

Score = (A,A) (C,G) (G,G)

+ +

= 1 + 0 + 1 = 2

Unit-cost

A|A

A is aligned with A

C|G

C is aligned with G

G is aligned with G

G|G

Algorithmic Functions of Computational Biology –

Professor Istrail

Gaps

ACATGGAATACAGGAAAT

ACAT GG - AAT ACA - GG AAAT

OPTIMALALIGNMENTS

SCORE 7 8

AAAGGGGGGAAA

SCORE 0 3

- - - AAAGGG GGGAAA - - -

“-” is the gap symbol

Algorithmic Functions of Computational Biology –

Professor Istrail

(x,y) = the score for aligning x with y

(-,y) = the score for aligning - with y

(x,-) = the score for aligning x with -

Algorithmic Functions of Computational Biology –

Professor Istrail

A-CG - GATCGTG

Alignment

Score

(A,A) + (G,G) +(C,C) +(-,T) + (-,T ) + (G,G)

THE SUM OF THE SCORES OF THE PAIRWISE ALIGNED SYMBOLS

Algorithmic Functions of Computational Biology –

Professor Istrail

Scoring Scheme

Dayhoff score

...

PTIPLSRLFDNAMLRAHRLHQSAIENQRLFNIAVSRVQHLHL

Partial alignment for Monkey and Trout somatotropin proteins

- A R N D C Q E G H I L K M F P S T W Y V

-8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8-8 3 -3 0 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0A

RND

64

Algorithmic Functions of Computational Biology –

Professor Istrail

Scoring Functions

Scoring function = a sum of a terms each for a pair of aligned residues, and for each gap

The meaning = log of the relative likelihood that the sequences are related, compared to being unrelated

Identities and conservative substitutions are Positive terms

Non-conservative substitutions are Negative terms

Algorithmic Functions of Computational Biology –

Professor Istrail

The Edit Graph

Suppose that we want to align AGT with AT

We are going to construct a graph where alignments between the two sequences correspond to paths between the begin and and end nodes of the graph.

This is the Edit Graph

Algorithmic Functions of Computational Biology –

Professor Istrail

0 1 2 3

0

1

2

AGT has length 3AT has length 2

The Edit graph has (3+1)*(2+1) nodes

The sequence AGT

The sequence AT

Algorithmic Functions of Computational Biology –

Professor Istrail

0 1 2 30

1

2

A G T

A

T

AGT indexes the columns, and AT indexes the rows of this “table”

Algorithmic Functions of Computational Biology –

Professor Istrail

0 1 2 30

1

2

A G T

A

T

The Graph is directed. The nodes (i,j) will hold values.

Algorithmic Functions of Computational Biology –

Professor Istrail

T0 1 2 30

1

2

A G

A

T

Algorithmic Functions of Computational Biology

Professor Istrail

T

A

T

0 1 2 30

1

2

A GA-

-A

AA

-A

-A-

A

-A

G-

A-

A-

G-

G-

T-

T-

T-

-T

-T

-T

-T

AT

GT

TT

GA

TA

Directed edges get as labels pairs of aligned letters.

Algorithmic Functions of Computational Biology –

Professor Istrail

Alignment = Path in the Edit GraphT

A

T

0 1 2 30

1

2

A GA-

-A

AA

-A

-A-

A

-A

G-

A-

A-

G-

G-

T-

T-

T-

-T

-T

-T

-T

AT

GT

TT

GA

TA

AGTA-T

Every path from Begin to End corresponds to an alignment

Every alignment corresponds to a path between Begin and End

Algorithmic Functions of Computational Biology –

Professor Istrail

The Principle of Optimality

The optimal answer to a problem is expressed in terms of optimal answer for its sub-problems

Algorithmic Functions of Computational Biology –

Professor Istrail

Dynamic Programming

Part 1: Compute first the optimal alignment score

Part 2: Construct optimal alignment

We are looking for the optimal alignment = maximal score path in the Edit Graph from the Begin vertex to the End vertex

Given: Two sequences X and YFind: An optimal alignment of X with Y

Algorithmic Functions of Computational Biology –

Professor Istrail

The DP Matrix S(i,j)

0 1 2 30

1

2

A G T

A

TS(2,1)

S(1,0)

Algorithmic Functions of Computational Biology –

Professor Istrail

The DP MatrixMatrix S =[S(i,j)]

S(i,j) = The score of the maximal cost path from the Begin Vertex and the vertex (i,j)

(i,j)

(i,j-1)

(i-1,j)

(i-1,j-1)The optimal path to (i,j)must pass through one of

the vertices

(i-1,j-1)

(i-1,j)

(i,j-1)

Algorithmic Functions of Computational Biology –

Professor Istrail

Opt path

(i,j)

(i,j-1)

(i-1,j)

(i-1,j-1)

Optimal path to (i-1,j) + (- , yj)

-xi

yj-

S(i-1,j) +

(- , yj)

Algorithmic Functions of Computational Biology –

Professor Istrail

Optimal path

(i,j)

(i-1,j)

(i,j-1)

(i-1,j-1)

Optimal path to (i-1,j-1) + (xi,yj)

S(i-1,j-1) + (xi , yj)

Algorithmic Functions of Computational Biology –

Professor Istrail

Optimal path

(i,j)

(i,j-1)

(I-1,j)

(i-1,j-1)

Optimal path to (i,j-1) + (xi,-)

S(i,j-1) + (xi, -)

Algorithmic Functions of Computational Biology –

Professor Istrail

The Basic ALGORITHM

S(i,j) =

S(i-1, j-1) + (xi, yj)

S(i-1, j) + (xi, -)

S(i, j-1) + (-, yj)

MAX

Algorithmic Functions of Computational Biology –

Professor Istrail

T

A

T

0 1 2 30

1

2

A GA-

-A

AA

-A

-A-

A

-A

G-

A-

A-

G-

G-

T-

T-

T-

-T

-T

-T

-T

AT

GT

TT

GA

TA

0

0

0

0

1

1

0

1

1

0

1

2AGTA - TOptimal Alignment

Optimal Alignment and TracbackAlgorithmic Functions of Computational Biology –

Professor Istrail

S(i,j) =

S(i-1, j-1) + (xi, yj),

S(i-1, j) + (xi, -),

S(i, j-1) + (-, yj)

MAX

0, We add this

The Basic ALGORITHM: Local Similarity

Algorithmic Functions of Computational Biology –

Professor Istrail

General Scoring Schemes

1. Independence of mutations at different sites

Additive scoring scheme

2. Gaps of any length are considered one mutation

All of the efficient alignment algorithms -- employing on the dynamicprogramming method --are based fundamentally on the of the fact that the scoring function is additive.

Assumptions

Algorithmic Functions of Computational Biology –

Professor Istrail

Substitutions Matrices

m

n

yyyY

xxxX

...21

...21

ji yx ,belong to

Consider ungapped alignment of equal length sequences

Compute the probability that the two sequences are related

Compute the probability that the two sequences are not related

Compute the ratio of the two probabilities

Algorithmic Functions of Computational Biology

Professor Istrail

Random Model R

Every letter z occurs independently with probability

yjx qqRyxP i)/,(

qz

Algorithmic Functions of Computational Biology - Course 3

Professor Istrail

Match Model M

Aligned pairs of residues occur with joint probability ab

pab

jiyxpMyxP )/,(

Algorithmic Functions of Computational Biology - Course 3

Professor Istrail

)/,(

)/,(

RyxP

MyxP)/ jxixiyj qqp

),( ii yxs

)/log(),( baab pppbas

= log

=

i

where

Log-odds ratio

s(a,b) = the substitution matrixAlgorithmic Functions of Computational Biology - Course 3

Professor Istrail

Recommended