Upload
arissa
View
49
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Class 3: Sequence similarity. Motivation. Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar substring of A, B Longest similar substring of A, B..Z For each, How big? How similar?. Define alignment. - PowerPoint PPT Presentation
Citation preview
Class 3: Sequence similarity
Motivation
• Same gene, or similar gene
• Suffix of A similar to prefix of B?
• Suffix of A similar to prefix of B..Z?
• Longest similar substring of A, B
• Longest similar substring of A, B..Z
• For each, How big? How similar?
Define alignment
• Align these two sequences optimallyGACGGATT
GATCGGTT
• Define precisely what an alignment is
Definition of alignment
• Insert spaces so that the letters line up, or letters align with spaces
GA-CGGATT
GATCGG-TT
• Don’t allow spaces to line up
• Allow spaces even at beginning and end
GCAT-
-CATG
Define similarity
• Given an alignment, compute a similarity score
• Three possibilities for each column
letter-letter match
letter-letter mismatch
letter-space mismatch
Optimal alignment
• Create score function
• Conventionally:
+1 bonus for match
-1 penalty for letter-letter mismatch
-2 penalty for letter-space mismatch
Dynamic programming solution
• Given sequences s,t of length m,n
• Strategy: build up optimal alignment of prefixes
• Base case?
• Recurrence relation?
Recurrence
• Given opt alignment of prefixes of s,t shorter than i,j, find opt of s[1..i], t[1..j]
• Three possibilities:– extend s by a letter, t by a space– extend s by a letter, t by a letter– extend s by a space, t by a letter
Tiny instance -- AGC, AAAC
0 -2 -4 -6 -8
-2
-4
-6
Some dp details
• What is a good order to fill the array?
• How do you recover the opt alignment?
• What do you do about ties?
• What is the space complexity of this algorithm?
• What is the time complexity of this algorithm?
The gap penalty
• Model above assumes two gaps of size 1 are equivalent to one gap of size 2
• Is this realistic? Why or why not?
General gap penalties
• Alignments can no longer be scored as the sum of their parts
• They still are the sum of blocks with one matched letter or one gap each
• Blocks are: matched letters, s-gap, t-gapA|A|C|---|A|GAT|A|A|C
A|C|T|CGG|T|---|A|A|T
DP for general gaps
• Requires three array, one for each block type
• Time complexity is cubic
• This is expensive at best, prohibitive for large problems
• See Setubal/Meidanis 3.3.2 for details
Affine gap penalty
• Charge h for each gap, plus g * (len(gap))
• This still has quadratic complexity!
• See Setubal/Meidanis
Point accepted mutations
• Some mutations are more likely than others
• In proteins, some amino acids are more similar than others (size, charge, hydrophobicity)
• A point accepted mutation matrix is a table with probabilityof each transition in fixed time
PAM matrices
• The entire matrix sums to 1
• A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change
Scoring matrix
• Consider aligned letters a,b
• Pr(b is a mutation of a) = Mab
• Pr(b is a random occurrence) = pb
• Score(a,b) = 10log(Mab / pb)
Blast
• Basic Local Alignment Search Tool
• Def: ‘segment’ is a subsequence (without gaps)
• Def: ‘segment pair’ is two segments of equal length
• Rem: the score of a segment pair is the sum of its aligned letters
What Blast does
• Input:– a PAM matrix– a database of sequences B– a query sequence A– a threshhold S
• Output:– all segment pairs(A,B) with score > S
How Blast works
• Compile short, high-scoring strings (words)
• Search for hits -- each hit gives a seed
• Extend seeds
Blast on proteins
• Words are w-mers which score at least T against A
• Use hashing or dfa to search for hits
• Extend seed until heuristically determined limit is reached
Blast on nucleic acids
• Words are w-mers in query A
• Letters compressed, four to byte
• Filter database B for very common words to avoid false positives
• Extend seeds as in proteins
What does Blast give you?
• Efficiency
• A rigorous statistical theory which gives the probability of a segment pair occurring by chance
Homework
• Given sequences s,t of length m,n, how many alignments do they have?
• Setubal/Meidanis, pp. 101, 102. Problems 2, 3, 4, 8, 16.