15
Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine with probability .2 Alanine Serine with probability .1 Leucine Serine with probability .3

Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

Embed Size (px)

Citation preview

Page 1: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

Constructing Probability Matrices

ReduxSuppose we live in a world with only 3 amino acids:

Alanine

Leucine

Serine

Furthermore suppose:

Alanine Leucine with probability .2

Alanine Serine with probability .1

Leucine Serine with probability .3

We will assume that these probabilities are for changes that take place during one time unit

Page 2: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

We can summarize these observations using the language of probability theory. We will use the notation (A|L, t) to mean: “A certain position in our sequence initially contains Leucine and at time, t, it contains Alanine.” Another way of saying this is, “After t time units the position contains Alanine given that it initially contained Leucine.” , i.e. the vertical bar means “given” So, Alanine given Leucine after t time units.

We then write:

Pr(A|A, 1) = .7 Pr(A|L, 1) = .2 Pr(A|S, 1) = .1

Pr(L|A, 1) = .2 Pr(L|L, 1) = .5 Pr(L|S, 1) = .3

Pr(S|A, 1) = .1 Pr(S|L, 1) = .3 Pr(S|S, 1) = .6

The above can be summarized in a table, called a matrix

1\2 A L S

A .7 .2 .1

L .2 .5 .3

S .1 .3 .6

6.3.1.

3.5.2.

1.2.7.

M

Page 3: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

What about the probabilities two time units later? For example what is the probability that a position that was originally Alanine is Alanine two time units later?

This can happen in three ways:

A A

A L A

S A

In our original notation, we are saying:

(A|A, 2) = (A|A, 1)and(A|A, 1) or (L|A, 1)and(A|L, 1) or (S|A, 1)and(A|S, 1)

Thus, to compute the probability,

Pr(A|A,2) = Pr(A|A,1)Pr(A|A,1) + Pr(L|A,1)Pr(A|L,1) + Pr(S|A,1)Pr(A|S,1)

= .7*.7 + .2*.2 + .1*.1 = .49 + .04 +.01 = .54

We will work out the 8 other second time unit transition probabilities in class.

Page 4: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

A L S

A .54 .27 .19

L .27 .38 .35

S .19 .35 .46

After we compute all 9 of the probabilities for the transitions after 2 time units we have the following table.

This table required three multiplications and two additions to compute the values placed in each of its nine cells. That is there where 27 multiplications and 18 additions required to produce the above table.

Page 5: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

The Matrix Connection

6.3.1.

3.5.2.

1.2.7.

M

Consider the matrix, M, that we constructed earlier when we made the table of probabilities

In matrix algebra, the product of two matrices is defined as follows:

To compute the product of two matrices A and B, the value placed in row, i, and column, j, is obtained by multiplying each value in row, i, of A by its corresponding element in column, j, of B and summing the results.

Translation by way of an illustration to follow.

Page 6: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

Let’s suppose we want to square M, i.e. multiply M by itself

6.3.1.

3.5.2.

1.2.7.

6.3.1.

3.5.2.

1.2.7.

To compute the value of the product matrix M2 in row, 2, column, 3, we multiply each element in row 2 of the first matrix by its corresponding element in row 3 of the second matrix and sum the results:

.2*.1 + .5*.3 + .3*.6 = .02 +.15 + .18 = .35

But this is exactly how we calculated Pr(S|L, 2)! This agreement between M2 and the table of transition probabilities holds for each position.

It appears that Matrix Multiplication is exactly what we need to generate the table of transition probabilities after t time units.

Page 7: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

46.35.19.

35.38.27.

19.27.54.

6.3.1.

3.5.2.

1.2.7.

6.3.1.

3.5.2.

1.2.7.2M

Thus, if we use the rules of matrix multiplication,

Since the rules of matrix multiplication and those for computing the transition probabilities are essentially the same, we have a marriage made by the divine. So let’s use them to our advantage.

Page 8: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

Recall, the PAM1 probability matrix from last period.

This is not symmetric (the values across the diagonal equal), but the rules for transition probabilities and matrix multiplication are the same. Therefore we can apply our previous observations.

Page 9: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

This matrix has 20 rows and 20 columns. To multiply it times itself would require 20 multiplications and 19 additions to compute the value in each of its 400 positions. That is:

400*(20 +19) = 8000 +7600 = 15,600 operations

That just gets us to the PAM2 matrix.

Most applications use the PAM250 matrix which means:

250*15,600 = 3,900,000 operations

This is a hefty load even with a computer to say nothing of lost accuracy due to computer word size limitations.

Fortunately, Matrix Algebra, has ways of cutting way down on the number of operations.

To learn more, see your local friendly neighborhood mathematician.

Page 10: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

Back to the Topic of Sequence Alignments

Page 11: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

We now leave the realm of exact alignments and adopt some Heuristics (an exploratory problem solving method based on experience and relying on past results for improvement of the technique. NOTE – not an algorithm that has been proven to be correct).

BLAST - Basic Local Alignment Search ToolPublished in 1990 by Altschul, Gish, Miller, Myers, and Lipman

Originally for ungapped local comparison of sequences. It has since been expanded to involve comparisons of gapped sequences.

There have been several extensions of the technique and improvements to the basic tool throughout the 14 years of its life thus far.

Page 12: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

Needleman-Wunsch, SemiGlobal Alignment, and Smith-Waterman assume we know which two sequences we need to compare.

BLAST is designed to do a database search for possible matching sequences:

1. There is no known starting point to begin the matches

2. There is not a well established format for the information stored in the data

3. It is like searching for a file in a cluttered office – see Professor Leinbach’s or Professor James’ offices for reference.

The amazing thing is that BLAST has been so successful!

Page 13: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

Consider this sequence

gtcaaatgaaaggagtttctacatttatgtcggaaatgctggaaacagcttctatattaa

We want to search for possible matches to gain clues to its identity

1. Place a sliding window 11 or 12 nucleotides long over the sequence

gtcaaatgaaaggagtttctacatttatgtcggaaatgctggaaacagcttctatattaa

2. Extract the window subsequence and compress it to 3 bytes

Code a as 002 c as 012 g as 102 and t as 112

Thus, the 11 characters take up 22 bits – 3 bytes with two bits unused

3. Using a Finite State Automaton, eliminate subsequences that occur with a very high frequency in the database

4. If the subsequence survives, i.e. is determined to be relatively rare, use a hash table to locate sequences in the database that contain that 11 nucleotide subsequence.

5. Extend the match in both directions scoring the extension until the match drops below a predetermined threshold. If it survives for the length of the original sequence – report the result

6. Slide the window down one character and repeat steps 2 - 6

The above is only an approximate description of the BLAST algorithm.

Page 14: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

Here is one of the BLAST results for our sequence:

Score = 44.1 bits (22), Expect = 0.035 Identities = 34/38 (89%), Gaps = 0/38 (0%) Strand=Plus/Plus

Query 12 GGAGTTTCTACATTTATGTCGGAAATGCTGGAAACAGC 49 ||||| || ||||||||| | |||||||||||||||||

Sbjct 5030 GGAGTGTCAACATTTATGGCTGAAATGCTGGAAACAGC 5067

With a report of:

gi|71835970|gb|DQ117988.1| Physcomitrella patens DNA mismatch repair protein MSH2 gene

The result along with 68 others was reported in about 30 seconds of searching on a Friday afternoon before a holiday. The search involved over 3 million subject sequences accounting for over 16 billion characters!

Page 15: Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine

For protein sequences the window is 4 amino acids long:

1 Amino Acid = 3 Nucleotides

4 Amino Acids = 12 Nucleotides = 8 Bytes coded

From Krane and Raymer page 49: