Computational Biology, Part 7 Similarity Functions and Sequence Comparison with Dot Matrices Robert F. Murphy Copyright 1996, 1999-2001. All rights reserved

Computational Biology, Part 7Similarity Functions and

Sequence Comparison with Dot Matrices

Computational Biology, Part 7Similarity Functions and

Sequence Comparison with Dot Matrices

Robert F. MurphyRobert F. Murphy

Copyright Copyright 1996, 1999-2001. 1996, 1999-2001.

All rights reserved.All rights reserved.

Similarity FunctionsSimilarity Functions

Used to facilitate comparison of two Used to facilitate comparison of two sequence elementssequence elements

logical valued (true or false, 1 or 0)logical valued (true or false, 1 or 0) test test whetherwhether first argument matches (or could first argument matches (or could

match) second argumentmatch) second argument numerical valuednumerical valued

test test degreedegree to which first argument matches to which first argument matches secondsecond

Logical valued similarity functionsLogical valued similarity functions Let Let Search(I)=‘A’ Search(I)=‘A’ and and Sequence(J)=‘R’Sequence(J)=‘R’ A Function to Test for Exact MatchA Function to Test for Exact Match

MatchExact(Search(I),Sequence(J)) MatchExact(Search(I),Sequence(J)) would return would return FALSEFALSE since A is not R since A is not R

A Function to Test for Possibility of a A Function to Test for Possibility of a Match using IUB codes for Incompletely Match using IUB codes for Incompletely Specified BasesSpecified Bases MatchWild(Search(I),Sequence(J)) MatchWild(Search(I),Sequence(J)) would return would return TRUETRUE

since R can be either A or Gsince R can be either A or G

Numerical valued similarity functionsNumerical valued similarity functions return value could be probability (for DNA)return value could be probability (for DNA)

Let Let Search(I) = 'A' Search(I) = 'A' and and Sequence(J) = 'R'Sequence(J) = 'R' SimilarNuc (Search(I),Sequence(J)) SimilarNuc (Search(I),Sequence(J)) could return 0.5could return 0.5

since chances are 1 out of 2 that a purine is adeninesince chances are 1 out of 2 that a purine is adenine

return value could be similarity (for protein)return value could be similarity (for protein) Let Let Seq1(I) = 'K' Seq1(I) = 'K' (lysine) and (lysine) and Seq2(J) = 'R' Seq2(J) = 'R' (arginine)(arginine) SimilarProt(Seq1(I),Seq2(J)) SimilarProt(Seq1(I),Seq2(J)) could return 0.8could return 0.8

since lysine is similar to argininesince lysine is similar to arginine

usually use integer values for efficiencyusually use integer values for efficiency

Scoring (similarity) matricesScoring (similarity) matrices

For each pair of characters in alphabet, For each pair of characters in alphabet, value is proportional to degree of similarity value is proportional to degree of similarity (or other scoring criterion) between them(or other scoring criterion) between them

For proteins, most frequently used is For proteins, most frequently used is Mutation Data Matrix from Dayhoff, 1978 Mutation Data Matrix from Dayhoff, 1978 (MDM(MDM7878))

Dayhoff PAM250 similarity matrix (partial)Dayhoff PAM250 similarity matrix (partial)

A B C D E F G H

A 2 0 - 2 0 0 - 4 1 - 1

B 0 0 - 4 3 2 - 5 0 1

C - 2 - 4 12 - 5 - 5 - 4 - 3 - 3

D 0 3 - 5 4 3 - 6 1 1

E 0 2 - 5 3 4 - 5 0 1

F - 4 - 5 - 4 - 6 - 5 9 - 5 - 2

G 1 0 - 3 1 0 - 5 5 - 2

H - 1 1 - 3 1 1 - 2 - 2 6

Origin of PAM 250 matrixOrigin of PAM 250 matrix

Take aligned set of closely related proteinsTake aligned set of closely related proteins For each position in the set, find the most common For each position in the set, find the most common

amino acid observed thereamino acid observed there Calculate the frequency with which each other Calculate the frequency with which each other

amino acid is observed at that positionamino acid is observed at that position Combine frequencies from all positions to give Combine frequencies from all positions to give

table showing frequencies for each amino acid table showing frequencies for each amino acid changing to each other amino acidchanging to each other amino acid

Take logarithm and normalize for frequency of Take logarithm and normalize for frequency of each amino acideach amino acid

Sequence comparison with dot matricesSequence comparison with dot matrices Goal: Goal: Graphically display regions of Graphically display regions of

similarity between two sequences (e.g., similarity between two sequences (e.g., domains in common between two proteins domains in common between two proteins of suspected similar function)of suspected similar function)

Sequence comparison with dot matricesSequence comparison with dot matrices Basic Method: Basic Method: For two sequences of For two sequences of

lengths M and N, lay out an M by N grid lengths M and N, lay out an M by N grid (matrix) with one sequence across the top (matrix) with one sequence across the top and one sequence down the left side. For and one sequence down the left side. For each position in the grid, compare the each position in the grid, compare the sequence elements at the top (column) and sequence elements at the top (column) and to the left (row). If and only if they are the to the left (row). If and only if they are the same, place a dot at that position.same, place a dot at that position.

Sequence comparison with dot matrices - ReferencesSequence comparison with dot matrices - References W.M. Fitch. An improved method of testing W.M. Fitch. An improved method of testing

for evolutionary homology. for evolutionary homology. J. Mol. Biol. J. Mol. Biol. 16:16:9-16 (1966)9-16 (1966)

W.M. Fitch. Locating gaps in amino acid W.M. Fitch. Locating gaps in amino acid sequences to optimize the homology sequences to optimize the homology between two proteins. between two proteins. Biochem. Genet. Biochem. Genet. 3:3:99-108 (1969)99-108 (1969)

Sequence comparison with dot matrices - ReferencesSequence comparison with dot matrices - References A.J. Gibbs & G.A. McIntyre. The diagram, A.J. Gibbs & G.A. McIntyre. The diagram,

a method for comparing sequences. Its use a method for comparing sequences. Its use with amino acid and nucleotide sequences. with amino acid and nucleotide sequences. Eur. J. Biochem. 16:Eur. J. Biochem. 16:1-11 (1970)1-11 (1970)

A.D. McLachlan. Test for comparing A.D. McLachlan. Test for comparing related amino acid sequences: cytochrome c related amino acid sequences: cytochrome c and cytochrome c551. and cytochrome c551. J. Mol. Biol. 61:J. Mol. Biol. 61:409-409-424 (1971)424 (1971)

Sequence comparison with dot matrices - ReferencesSequence comparison with dot matrices - References J. Pustell & F.C. Kafatos. A high speed, high J. Pustell & F.C. Kafatos. A high speed, high

capacity homology matrix: zooming through capacity homology matrix: zooming through SV40 and polyoma. SV40 and polyoma. Nucleic Acids Res. 10:Nucleic Acids Res. 10:4765-4765-4782 (1982)4782 (1982)

J. Pustell & F.C. Kafatos. A convenient and J. Pustell & F.C. Kafatos. A convenient and adaptable package of computer programs for adaptable package of computer programs for DNA and protein sequence management, analysis DNA and protein sequence management, analysis and homology determination. and homology determination. Nucleic Acids Res. Nucleic Acids Res. 12:12:643-655 (1984)643-655 (1984)

Examples for protein sequencesExamples for protein sequences

(Demonstration A5, Sequence 1 vs. 2)(Demonstration A5, Sequence 1 vs. 2) (Demonstration A5, Sequence 2 vs. 3)(Demonstration A5, Sequence 2 vs. 3)

Interpretation of dot matricesInterpretation of dot matrices

Regions of similarity appear as diagonal Regions of similarity appear as diagonal runs of dotsruns of dots

Reverse diagonals (perpendicular to Reverse diagonals (perpendicular to diagonal) indicate inversionsdiagonal) indicate inversions

Reverse diagonals crossing diagonals (Xs) Reverse diagonals crossing diagonals (Xs) indicate palindromesindicate palindromes (Demonstration A5, Sequence 4 vs. 4)(Demonstration A5, Sequence 4 vs. 4)

Interpretation of dot matricesInterpretation of dot matrices

Can link or "join" separate diagonals to Can link or "join" separate diagonals to form form alignmentalignment with "gaps" with "gaps" Each a.a. or base can only be used onceEach a.a. or base can only be used once

Can't trace vertically or horizontallyCan't trace vertically or horizontally Can't double backCan't double back

A gap is introduced by each vertical or A gap is introduced by each vertical or horizontal skiphorizontal skip

Uses for dot matricesUses for dot matrices

Can use dot matrices to align two proteins Can use dot matrices to align two proteins or two nucleic acid sequencesor two nucleic acid sequences

Can use to find amino acid repeats within a Can use to find amino acid repeats within a protein by comparing a protein sequence to protein by comparing a protein sequence to itselfitself Repeats appear as a set of diagonal runs stacked Repeats appear as a set of diagonal runs stacked

vertically and/or horizontallyvertically and/or horizontally (Demonstration A5, Sequence 5 vs. 6)(Demonstration A5, Sequence 5 vs. 6)

Uses for dot matricesUses for dot matrices

Can use to find self base-pairing of an RNA Can use to find self base-pairing of an RNA (e.g., tRNA) by comparing a sequence to (e.g., tRNA) by comparing a sequence to itself complemented and reverseditself complemented and reversed

Excellent approach for finding sequence Excellent approach for finding sequence transpositionstranspositions

Filtering to remove “noise”Filtering to remove “noise”

A problem with dot matrices for long A problem with dot matrices for long sequences is that they can be very noisy due sequences is that they can be very noisy due to lots of insignificant matches (i.e., one A)to lots of insignificant matches (i.e., one A)

Solution use a window and a thresholdSolution use a window and a threshold compare character by character within a compare character by character within a

window (have to choose window size)window (have to choose window size) require certain fraction of matches within require certain fraction of matches within

window in order to display it with a “dot”window in order to display it with a “dot”

Example spreadsheet with windowExample spreadsheet with window (Demonstration A6)(Demonstration A6)

How do we choose a window size?How do we choose a window size? Window size changes with goal of analysisWindow size changes with goal of analysis

size of average exonsize of average exon size of average protein structural elementsize of average protein structural element size of gene promotersize of gene promoter size of enzyme active sitesize of enzyme active site

How do we choose a threshold value?How do we choose a threshold value? Threshold based on statisticsThreshold based on statistics

using shuffled actual sequenceusing shuffled actual sequence find average (find average (mm) and s.d. () and s.d. () of match scores of ) of match scores of

shuffled sequenceshuffled sequence convert original (unshuffled) scores (convert original (unshuffled) scores (xx) to) to ZZ scoresscores

• Z = (x - m)/Z = (x - m)/ use threshold Z of of 3 to 6use threshold Z of of 3 to 6

using analysis of other sets of sequencesusing analysis of other sets of sequences provides “objective” standard of significanceprovides “objective” standard of significance

Displaying matrices by Pustell method with MacVectorDisplaying matrices by Pustell method with MacVector Goal: Determine differences in Goal: Determine differences in

arrangements of elements of pBluescript arrangements of elements of pBluescript family of vectorsfamily of vectors

Starting point: Use sequences of three of the Starting point: Use sequences of three of the members of the family: open the first three members of the family: open the first three files in the files in the Common Vectors: Bluescript Common Vectors: Bluescript folder.folder.

Dot matrices with MacVectorDot matrices with MacVector From From AnalyzeAnalyze menu select menu select Pustell DNA matrixPustell DNA matrix. Dialog appears.. Dialog appears.

Dot matrices with MacVectorDot matrices with MacVector

Select Select SYNBL2KSMSYNBL2KSM and and SYNBL2SKMSYNBL2SKM. Use defaults for all else.. Use defaults for all else.

Dot matrices with MacVectorDot matrices with MacVector 23 reagons of homology (“diagonals”) obtained. Request “23 reagons of homology (“diagonals”) obtained. Request “Matrix Matrix

mapmap” only (don’t need “” only (don’t need “Aligned sequencesAligned sequences”)”)

Dot matrices with MacVectorDot matrices with MacVector Note Note inversioninversion near nucleotide 700 (the direction of the polylinker is near nucleotide 700 (the direction of the polylinker is

reversed between the two vectors)reversed between the two vectors)

SYNBL2KSM

Window Size = 30 Scoring Matrix: DNA identity matrixMin. % Score = 65Hash Value = 6

Strand = BothJump = 1

500 1000 1500 2000 2500

500

1000

1500

2000

2500


To examine effect of threshold, decrease “To examine effect of threshold, decrease “min. % scoremin. % score” from ” from 6565 to to 55 55


Now we get Now we get manymany (223) diagonals. (223) diagonals.

Dot matrices with MacVectorDot matrices with MacVector Note presence of many short regions of at least 55% homology.Note presence of many short regions of at least 55% homology.

SYNBL2KSM



500 1000 1500 2000 2500

500

1000

1500

2000

2500


Now increase threshold to Now increase threshold to 9090%.%.


Now just Now just 33 diagonals are found. diagonals are found.

Dot matrices with MacVectorDot matrices with MacVector Note absence of short homologous regions (“noise”).Note absence of short homologous regions (“noise”).

SYNBL2KSM



500 1000 1500 2000 2500

500

1000

1500

2000

2500


Now compare Now compare SYNBL2KSPSYNBL2KSP to to SYNBL2SKMSYNBL2SKM..


22 diagonals found using default settings.22 diagonals found using default settings.

Dot matrices with MacVectorDot matrices with MacVector Note Note secondsecond large inversion at one end of sequences. large inversion at one end of sequences.

SYNBL2KSP



500 1000 1500 2000 2500

500

1000

1500

2000

2500

More dot matrices with MacVector - DNA homologyMore dot matrices with MacVector - DNA homology Goal: Duplicate Figure 6 of Chapter 3 of Goal: Duplicate Figure 6 of Chapter 3 of

Sequence Analysis PrimerSequence Analysis Primer Get Accession numbers J02289 (Polyoma) Get Accession numbers J02289 (Polyoma)

and J02400 (SV40) from Entrezand J02400 (SV40) from Entrez Do Pustell DNA Matrix analysis using Do Pustell DNA Matrix analysis using

parameters similar to those used in text parameters similar to those used in text (window size = 41, %identity = 51)(window size = 41, %identity = 51)

More dot matrices with MacVector - DNA homologyMore dot matrices with MacVector - DNA homology



J02289 Polyoma a3.MacV


Strand = +-Jump = 1

1000 2000 3000 4000 5000

1000

2000

3000

4000

5000

More dot matrices with MacVector - protein homologyMore dot matrices with MacVector - protein homology Goal: Reproduce Figure 15 from Chapter 3 Goal: Reproduce Figure 15 from Chapter 3

of Sequence Analysis Primerof Sequence Analysis Primer Get Accession numbers P17678 (Chicken) Get Accession numbers P17678 (Chicken)

and X17254 (human) erythroid transcription and X17254 (human) erythroid transcription factors using Entrezfactors using Entrez

Do Pustell Protein Matrix AnalysisDo Pustell Protein Matrix Analysis

Reading for next classReading for next class

B & O, Chapter 7 just pp. 145-155B & O, Chapter 7 just pp. 145-155 Additional optional reading: Sequence Additional optional reading: Sequence

Analysis Primer, pp. 124-134 “Dynamic Analysis Primer, pp. 124-134 “Dynamic Programming Methods” (on web site as Programming Methods” (on web site as Reading 1)Reading 1)

(03-510) Durbin et al, Sections 2.1 - 2.4(03-510) Durbin et al, Sections 2.1 - 2.4 Everybody: Look over paper by Needleman Everybody: Look over paper by Needleman

and Wunsch on web site (Reading 2)and Wunsch on web site (Reading 2)

Summary, Part 7Summary, Part 7

Similarity functions or similarity matrices Similarity functions or similarity matrices describe (quantitatively) the degree of describe (quantitatively) the degree of similarity between two sequence elements similarity between two sequence elements (bases or amino acids)(bases or amino acids)

The Dayhoff MDM78 matrix is a similarity The Dayhoff MDM78 matrix is a similarity matrix commonly used to estimate the degree matrix commonly used to estimate the degree to which a change from one amino acid to to which a change from one amino acid to another can be “tolerated” in a proteinanother can be “tolerated” in a protein

Summary, Part 7Summary, Part 7

Dot matrices graphically present regions of Dot matrices graphically present regions of identity or similarity between two identity or similarity between two sequencessequences

The use of windows and thresholds can The use of windows and thresholds can reduce “noise” in dot matricesreduce “noise” in dot matrices

Inversions, duplications and palindromes Inversions, duplications and palindromes have unique “signatures” in dot matriceshave unique “signatures” in dot matrices

Documents

Computational Biology, Part 7 Similarity Functions and Sequence Comparison with Dot Matrices Robert F. Murphy Copyright 1996, 1999-2001. All rights reserved