Upload
shayne-sellon
View
220
Download
0
Embed Size (px)
Citation preview
Computational Biology, Part 7Similarity Functions and
Sequence Comparison with Dot Matrices
Computational Biology, Part 7Similarity Functions and
Sequence Comparison with Dot Matrices
Robert F. MurphyRobert F. Murphy
Copyright Copyright 1996, 1999-2001. 1996, 1999-2001.
All rights reserved.All rights reserved.
Similarity FunctionsSimilarity Functions
Used to facilitate comparison of two Used to facilitate comparison of two sequence elementssequence elements
logical valued (true or false, 1 or 0)logical valued (true or false, 1 or 0) test test whetherwhether first argument matches (or could first argument matches (or could
match) second argumentmatch) second argument numerical valuednumerical valued
test test degreedegree to which first argument matches to which first argument matches secondsecond
Logical valued similarity functionsLogical valued similarity functions Let Let Search(I)=‘A’ Search(I)=‘A’ and and Sequence(J)=‘R’Sequence(J)=‘R’ A Function to Test for Exact MatchA Function to Test for Exact Match
MatchExact(Search(I),Sequence(J)) MatchExact(Search(I),Sequence(J)) would return would return FALSEFALSE since A is not R since A is not R
A Function to Test for Possibility of a A Function to Test for Possibility of a Match using IUB codes for Incompletely Match using IUB codes for Incompletely Specified BasesSpecified Bases MatchWild(Search(I),Sequence(J)) MatchWild(Search(I),Sequence(J)) would return would return TRUETRUE
since R can be either A or Gsince R can be either A or G
Numerical valued similarity functionsNumerical valued similarity functions return value could be probability (for DNA)return value could be probability (for DNA)
Let Let Search(I) = 'A' Search(I) = 'A' and and Sequence(J) = 'R'Sequence(J) = 'R' SimilarNuc (Search(I),Sequence(J)) SimilarNuc (Search(I),Sequence(J)) could return 0.5could return 0.5
since chances are 1 out of 2 that a purine is adeninesince chances are 1 out of 2 that a purine is adenine
return value could be similarity (for protein)return value could be similarity (for protein) Let Let Seq1(I) = 'K' Seq1(I) = 'K' (lysine) and (lysine) and Seq2(J) = 'R' Seq2(J) = 'R' (arginine)(arginine) SimilarProt(Seq1(I),Seq2(J)) SimilarProt(Seq1(I),Seq2(J)) could return 0.8could return 0.8
since lysine is similar to argininesince lysine is similar to arginine
usually use integer values for efficiencyusually use integer values for efficiency
Scoring (similarity) matricesScoring (similarity) matrices
For each pair of characters in alphabet, For each pair of characters in alphabet, value is proportional to degree of similarity value is proportional to degree of similarity (or other scoring criterion) between them(or other scoring criterion) between them
For proteins, most frequently used is For proteins, most frequently used is Mutation Data Matrix from Dayhoff, 1978 Mutation Data Matrix from Dayhoff, 1978 (MDM(MDM7878))
Dayhoff PAM250 similarity matrix (partial)Dayhoff PAM250 similarity matrix (partial)
A B C D E F G H
A 2 0 - 2 0 0 - 4 1 - 1
B 0 0 - 4 3 2 - 5 0 1
C - 2 - 4 12 - 5 - 5 - 4 - 3 - 3
D 0 3 - 5 4 3 - 6 1 1
E 0 2 - 5 3 4 - 5 0 1
F - 4 - 5 - 4 - 6 - 5 9 - 5 - 2
G 1 0 - 3 1 0 - 5 5 - 2
H - 1 1 - 3 1 1 - 2 - 2 6
Origin of PAM 250 matrixOrigin of PAM 250 matrix
Take aligned set of closely related proteinsTake aligned set of closely related proteins For each position in the set, find the most common For each position in the set, find the most common
amino acid observed thereamino acid observed there Calculate the frequency with which each other Calculate the frequency with which each other
amino acid is observed at that positionamino acid is observed at that position Combine frequencies from all positions to give Combine frequencies from all positions to give
table showing frequencies for each amino acid table showing frequencies for each amino acid changing to each other amino acidchanging to each other amino acid
Take logarithm and normalize for frequency of Take logarithm and normalize for frequency of each amino acideach amino acid
Sequence comparison with dot matricesSequence comparison with dot matrices Goal: Goal: Graphically display regions of Graphically display regions of
similarity between two sequences (e.g., similarity between two sequences (e.g., domains in common between two proteins domains in common between two proteins of suspected similar function)of suspected similar function)
Sequence comparison with dot matricesSequence comparison with dot matrices Basic Method: Basic Method: For two sequences of For two sequences of
lengths M and N, lay out an M by N grid lengths M and N, lay out an M by N grid (matrix) with one sequence across the top (matrix) with one sequence across the top and one sequence down the left side. For and one sequence down the left side. For each position in the grid, compare the each position in the grid, compare the sequence elements at the top (column) and sequence elements at the top (column) and to the left (row). If and only if they are the to the left (row). If and only if they are the same, place a dot at that position.same, place a dot at that position.
Sequence comparison with dot matrices - ReferencesSequence comparison with dot matrices - References W.M. Fitch. An improved method of testing W.M. Fitch. An improved method of testing
for evolutionary homology. for evolutionary homology. J. Mol. Biol. J. Mol. Biol. 16:16:9-16 (1966)9-16 (1966)
W.M. Fitch. Locating gaps in amino acid W.M. Fitch. Locating gaps in amino acid sequences to optimize the homology sequences to optimize the homology between two proteins. between two proteins. Biochem. Genet. Biochem. Genet. 3:3:99-108 (1969)99-108 (1969)
Sequence comparison with dot matrices - ReferencesSequence comparison with dot matrices - References A.J. Gibbs & G.A. McIntyre. The diagram, A.J. Gibbs & G.A. McIntyre. The diagram,
a method for comparing sequences. Its use a method for comparing sequences. Its use with amino acid and nucleotide sequences. with amino acid and nucleotide sequences. Eur. J. Biochem. 16:Eur. J. Biochem. 16:1-11 (1970)1-11 (1970)
A.D. McLachlan. Test for comparing A.D. McLachlan. Test for comparing related amino acid sequences: cytochrome c related amino acid sequences: cytochrome c and cytochrome c551. and cytochrome c551. J. Mol. Biol. 61:J. Mol. Biol. 61:409-409-424 (1971)424 (1971)
Sequence comparison with dot matrices - ReferencesSequence comparison with dot matrices - References J. Pustell & F.C. Kafatos. A high speed, high J. Pustell & F.C. Kafatos. A high speed, high
capacity homology matrix: zooming through capacity homology matrix: zooming through SV40 and polyoma. SV40 and polyoma. Nucleic Acids Res. 10:Nucleic Acids Res. 10:4765-4765-4782 (1982)4782 (1982)
J. Pustell & F.C. Kafatos. A convenient and J. Pustell & F.C. Kafatos. A convenient and adaptable package of computer programs for adaptable package of computer programs for DNA and protein sequence management, analysis DNA and protein sequence management, analysis and homology determination. and homology determination. Nucleic Acids Res. Nucleic Acids Res. 12:12:643-655 (1984)643-655 (1984)
Examples for protein sequencesExamples for protein sequences
(Demonstration A5, Sequence 1 vs. 2)(Demonstration A5, Sequence 1 vs. 2) (Demonstration A5, Sequence 2 vs. 3)(Demonstration A5, Sequence 2 vs. 3)
Interpretation of dot matricesInterpretation of dot matrices
Regions of similarity appear as diagonal Regions of similarity appear as diagonal runs of dotsruns of dots
Reverse diagonals (perpendicular to Reverse diagonals (perpendicular to diagonal) indicate inversionsdiagonal) indicate inversions
Reverse diagonals crossing diagonals (Xs) Reverse diagonals crossing diagonals (Xs) indicate palindromesindicate palindromes (Demonstration A5, Sequence 4 vs. 4)(Demonstration A5, Sequence 4 vs. 4)
Interpretation of dot matricesInterpretation of dot matrices
Can link or "join" separate diagonals to Can link or "join" separate diagonals to form form alignmentalignment with "gaps" with "gaps" Each a.a. or base can only be used onceEach a.a. or base can only be used once
Can't trace vertically or horizontallyCan't trace vertically or horizontally Can't double backCan't double back
A gap is introduced by each vertical or A gap is introduced by each vertical or horizontal skiphorizontal skip
Uses for dot matricesUses for dot matrices
Can use dot matrices to align two proteins Can use dot matrices to align two proteins or two nucleic acid sequencesor two nucleic acid sequences
Can use to find amino acid repeats within a Can use to find amino acid repeats within a protein by comparing a protein sequence to protein by comparing a protein sequence to itselfitself Repeats appear as a set of diagonal runs stacked Repeats appear as a set of diagonal runs stacked
vertically and/or horizontallyvertically and/or horizontally (Demonstration A5, Sequence 5 vs. 6)(Demonstration A5, Sequence 5 vs. 6)
Uses for dot matricesUses for dot matrices
Can use to find self base-pairing of an RNA Can use to find self base-pairing of an RNA (e.g., tRNA) by comparing a sequence to (e.g., tRNA) by comparing a sequence to itself complemented and reverseditself complemented and reversed
Excellent approach for finding sequence Excellent approach for finding sequence transpositionstranspositions
Filtering to remove “noise”Filtering to remove “noise”
A problem with dot matrices for long A problem with dot matrices for long sequences is that they can be very noisy due sequences is that they can be very noisy due to lots of insignificant matches (i.e., one A)to lots of insignificant matches (i.e., one A)
Solution use a window and a thresholdSolution use a window and a threshold compare character by character within a compare character by character within a
window (have to choose window size)window (have to choose window size) require certain fraction of matches within require certain fraction of matches within
window in order to display it with a “dot”window in order to display it with a “dot”
Example spreadsheet with windowExample spreadsheet with window (Demonstration A6)(Demonstration A6)
How do we choose a window size?How do we choose a window size? Window size changes with goal of analysisWindow size changes with goal of analysis
size of average exonsize of average exon size of average protein structural elementsize of average protein structural element size of gene promotersize of gene promoter size of enzyme active sitesize of enzyme active site
How do we choose a threshold value?How do we choose a threshold value? Threshold based on statisticsThreshold based on statistics
using shuffled actual sequenceusing shuffled actual sequence find average (find average (mm) and s.d. () and s.d. () of match scores of ) of match scores of
shuffled sequenceshuffled sequence convert original (unshuffled) scores (convert original (unshuffled) scores (xx) to) to ZZ scoresscores
• Z = (x - m)/Z = (x - m)/ use threshold Z of of 3 to 6use threshold Z of of 3 to 6
using analysis of other sets of sequencesusing analysis of other sets of sequences provides “objective” standard of significanceprovides “objective” standard of significance
Displaying matrices by Pustell method with MacVectorDisplaying matrices by Pustell method with MacVector Goal: Determine differences in Goal: Determine differences in
arrangements of elements of pBluescript arrangements of elements of pBluescript family of vectorsfamily of vectors
Starting point: Use sequences of three of the Starting point: Use sequences of three of the members of the family: open the first three members of the family: open the first three files in the files in the Common Vectors: Bluescript Common Vectors: Bluescript folder.folder.
Dot matrices with MacVectorDot matrices with MacVector From From AnalyzeAnalyze menu select menu select Pustell DNA matrixPustell DNA matrix. Dialog appears.. Dialog appears.
Dot matrices with MacVectorDot matrices with MacVector
Select Select SYNBL2KSMSYNBL2KSM and and SYNBL2SKMSYNBL2SKM. Use defaults for all else.. Use defaults for all else.
Dot matrices with MacVectorDot matrices with MacVector 23 reagons of homology (“diagonals”) obtained. Request “23 reagons of homology (“diagonals”) obtained. Request “Matrix Matrix
mapmap” only (don’t need “” only (don’t need “Aligned sequencesAligned sequences”)”)
Dot matrices with MacVectorDot matrices with MacVector Note Note inversioninversion near nucleotide 700 (the direction of the polylinker is near nucleotide 700 (the direction of the polylinker is
reversed between the two vectors)reversed between the two vectors)
SYNBL2KSM
Window Size = 30 Scoring Matrix: DNA identity matrixMin. % Score = 65Hash Value = 6
Strand = BothJump = 1
500 1000 1500 2000 2500
500
1000
1500
2000
2500
Dot matrices with MacVectorDot matrices with MacVector
To examine effect of threshold, decrease “To examine effect of threshold, decrease “min. % scoremin. % score” from ” from 6565 to to 55 55
Dot matrices with MacVectorDot matrices with MacVector
Now we get Now we get manymany (223) diagonals. (223) diagonals.
Dot matrices with MacVectorDot matrices with MacVector Note presence of many short regions of at least 55% homology.Note presence of many short regions of at least 55% homology.
SYNBL2KSM
Window Size = 30 Scoring Matrix: DNA identity matrixMin. % Score = 55Hash Value = 6
Strand = BothJump = 1
500 1000 1500 2000 2500
500
1000
1500
2000
2500
Dot matrices with MacVectorDot matrices with MacVector
Now increase threshold to Now increase threshold to 9090%.%.
Dot matrices with MacVectorDot matrices with MacVector
Now just Now just 33 diagonals are found. diagonals are found.
Dot matrices with MacVectorDot matrices with MacVector Note absence of short homologous regions (“noise”).Note absence of short homologous regions (“noise”).
SYNBL2KSM
Window Size = 30 Scoring Matrix: DNA identity matrixMin. % Score = 90Hash Value = 6
Strand = BothJump = 1
500 1000 1500 2000 2500
500
1000
1500
2000
2500
Dot matrices with MacVectorDot matrices with MacVector
Now compare Now compare SYNBL2KSPSYNBL2KSP to to SYNBL2SKMSYNBL2SKM..
Dot matrices with MacVectorDot matrices with MacVector
22 diagonals found using default settings.22 diagonals found using default settings.
Dot matrices with MacVectorDot matrices with MacVector Note Note secondsecond large inversion at one end of sequences. large inversion at one end of sequences.
SYNBL2KSP
Window Size = 30 Scoring Matrix: DNA identity matrixMin. % Score = 65Hash Value = 6
Strand = BothJump = 1
500 1000 1500 2000 2500
500
1000
1500
2000
2500
More dot matrices with MacVector - DNA homologyMore dot matrices with MacVector - DNA homology Goal: Duplicate Figure 6 of Chapter 3 of Goal: Duplicate Figure 6 of Chapter 3 of
Sequence Analysis PrimerSequence Analysis Primer Get Accession numbers J02289 (Polyoma) Get Accession numbers J02289 (Polyoma)
and J02400 (SV40) from Entrezand J02400 (SV40) from Entrez Do Pustell DNA Matrix analysis using Do Pustell DNA Matrix analysis using
parameters similar to those used in text parameters similar to those used in text (window size = 41, %identity = 51)(window size = 41, %identity = 51)
More dot matrices with MacVector - DNA homologyMore dot matrices with MacVector - DNA homology
More dot matrices with MacVector - DNA homologyMore dot matrices with MacVector - DNA homology
More dot matrices with MacVector - DNA homologyMore dot matrices with MacVector - DNA homology
J02289 Polyoma a3.MacV
Window Size = 41 Scoring Matrix: DNA identity matrixMin. % Score = 51Hash Value = 4
Strand = +-Jump = 1
1000 2000 3000 4000 5000
1000
2000
3000
4000
5000
More dot matrices with MacVector - protein homologyMore dot matrices with MacVector - protein homology Goal: Reproduce Figure 15 from Chapter 3 Goal: Reproduce Figure 15 from Chapter 3
of Sequence Analysis Primerof Sequence Analysis Primer Get Accession numbers P17678 (Chicken) Get Accession numbers P17678 (Chicken)
and X17254 (human) erythroid transcription and X17254 (human) erythroid transcription factors using Entrezfactors using Entrez
Do Pustell Protein Matrix AnalysisDo Pustell Protein Matrix Analysis
Reading for next classReading for next class
B & O, Chapter 7 just pp. 145-155B & O, Chapter 7 just pp. 145-155 Additional optional reading: Sequence Additional optional reading: Sequence
Analysis Primer, pp. 124-134 “Dynamic Analysis Primer, pp. 124-134 “Dynamic Programming Methods” (on web site as Programming Methods” (on web site as Reading 1)Reading 1)
(03-510) Durbin et al, Sections 2.1 - 2.4(03-510) Durbin et al, Sections 2.1 - 2.4 Everybody: Look over paper by Needleman Everybody: Look over paper by Needleman
and Wunsch on web site (Reading 2)and Wunsch on web site (Reading 2)
Summary, Part 7Summary, Part 7
Similarity functions or similarity matrices Similarity functions or similarity matrices describe (quantitatively) the degree of describe (quantitatively) the degree of similarity between two sequence elements similarity between two sequence elements (bases or amino acids)(bases or amino acids)
The Dayhoff MDM78 matrix is a similarity The Dayhoff MDM78 matrix is a similarity matrix commonly used to estimate the degree matrix commonly used to estimate the degree to which a change from one amino acid to to which a change from one amino acid to another can be “tolerated” in a proteinanother can be “tolerated” in a protein
Summary, Part 7Summary, Part 7
Dot matrices graphically present regions of Dot matrices graphically present regions of identity or similarity between two identity or similarity between two sequencessequences
The use of windows and thresholds can The use of windows and thresholds can reduce “noise” in dot matricesreduce “noise” in dot matrices
Inversions, duplications and palindromes Inversions, duplications and palindromes have unique “signatures” in dot matriceshave unique “signatures” in dot matrices